<h1><center> Foursquare Location Matching </center></h1>
<h2><center> Exploratory Data Analysis </center></h2>
<h2><center> Sugata Ghosh </center></h2>

### Contents

- [Introduction](#1.-Introduction)
- [Basic Data Exploration](#2.-Basic-Data-Exploration)
- [Univariate Analysis - Training Set](#3.-Univariate-Analysis---Training-Set)
- [Multivariate Analysis - Training Set](#4.-Multivariate-Analysis---Training-Set)
- [Univariate Analysis - Pairs Set](#5.-Univariate-Analysis---Pairs-Set)
- [Multivariate Analysis - Pairs Set](#6.-Multivariate-Analysis---Pairs-Set)
- [Acknowledgements](#Acknowledgements)
- [References](#References)

### Importing libraries

In [None]:
# File system manangement
import time, psutil, os, gc

# Mathematical functions
import math

# Data manipulation
import numpy as np
import pandas as pd

# Plotting and visualization
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
sns.set_theme()
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# Others
import operator as op
from functools import reduce, lru_cache

# Suppress or allow warnings
import warnings
warnings.filterwarnings("ignore")

### Runtime and memory usage

In [None]:
# Recording the starting time, complemented with a stopping time check in the end to compute process runtime
start = time.time()

# Class representing the OS process and having memory_info() method to compute process memory usage
process = psutil.Process(os.getpid())

# 1. Introduction

- [Point of Interest (POI)](#Point-of-Interest-(POI))
- [The Problem of POI Matching](#The-Problem-of-POI-Matching)
- [About Foursquare](#About-Foursquare)
- [Data](#Data)
- [Project Objective](#Project-Objective)
- [Evaluation Metric](#Evaluation-Metric)

## Point of Interest (POI)

A [point of interest](https://en.wikipedia.org/wiki/Point_of_interest) (POI) is a specific point location that someone may find useful or interesting. An example is a point on the Earth representing the location of the Eiffel Tower, or a point on Mars representing the location of its highest mountain, [Olympus Mons](https://en.wikipedia.org/wiki/Olympus_Mons). Most consumers use the term when referring to hotels, campsites, fuel stations or any other categories used in modern automotive navigation systems. Users of a mobile device can be provided with geolocation and time aware POI service that recommends geolocations nearby and with a temporal relevance (e.g. POI to special services in a ski resort are available only in winter). The notion of POI is widely used in cartography, especially in electronic variants including GIS, and GPS navigation software.

## The Problem of POI Matching

It is useful to combine POI data obtained from multiple sources for effective reusability. One issue in merging such data is that different dataset may have variations in POI name, address, and other identifying information for the same POI. It is thus important to identify observations which refer to the same POI. The process of POI matching involves finding POI pairs that refer to the same real-world entity, which is the core issue in geospatial data integration and is perhaps the most technically difficult part of multi-source POI fusion. The raw location data can contain noise, unstructured information, and incomplete or inaccurate attributes, which makes the task even more difficult. Nonetheless, to maintain the highest level of accuracy, the data must be matched and duplicate POIs must be identified and merged with timely updates from multiple sources. A combination of machine-learning algorithms and rigorous human validation methods are optimal for effective de-duplication of such data.

## About Foursquare

[Foursquare Labs Inc.](https://foursquare.com/), commonly known as Foursquare, is an American location technology company and data cloud platform. The company's location platform is the foundation of several business and consumer products, including the [Foursquare City Guide](https://en.wikipedia.org/wiki/Foursquare_City_Guide) and [Foursquare Swarm](https://en.wikipedia.org/wiki/Foursquare_Swarm) apps. Foursquare's products include Pilgrim SDK, Places, Visits, Attribution, Audience, Proximity, and Unfolded Studio. It is one of the leading independent providers of global POI data and is dedicated to building meaningful bridges between digital spaces and physical places. Trusted by leading enterprises like Apple, Microsoft, Samsung, and Uber, Foursquare's tech stack harnesses the power of places and movement to improve customer experiences and drive better business outcomes.

## Data

**Source:** https://www.kaggle.com/competitions/foursquare-location-matching/data

The data considered in the competition comprises over one-and-a-half million place entries for hundreds of thousands of commercial Points-of-Interest (POIs) around the globe. Though the data entries may represent or resemble entries for real places, they may be contaminated with artificial information or additional noise.

The training data comprises eleven attribute fields for over one million place entries, together with:
- `id` : A unique identifier for each entry.
- `point_of_interest` : An identifier for the POI the entry represents. There may be one or many entries describing the same POI. Two entries *match* when they describe a common POI.

In [None]:
# Loading the training data
data_train = pd.read_csv('../input/foursquare-location-matching/train.csv')
print(pd.Series({"Memory usage": "{:.2f} MB".format(data_train.memory_usage().sum()/(1024*1024)),
                 "Dataset shape": "{}".format(data_train.shape)}).to_string())
print(" ")
data_train.head()

In [None]:
# A typical observation from the training set
data_train.iloc[0]

The pairs data is a pregenerated set of pairs of place entries from the training data designed to improve detection of matches. It includes:
- `match` : Boolean variables denoting whether or not the pair of entries describes a common POI.

In [None]:
# Loading pregenerated set of pairs of place entries from the training data
data_pairs = pd.read_csv('../input/foursquare-location-matching/pairs.csv')
print(pd.Series({"Memory usage": "{:.2f} MB".format(data_pairs.memory_usage().sum()/(1024*1024)),
                 "Dataset shape": "{}".format(data_pairs.shape)}).to_string())
print(" ")
data_pairs.head()

In [None]:
# A typical observation from the pregenerated set of pairs
data_pairs.iloc[0]

The test data comprises a set of place entries with their recorded attribute fields, similar to the training set. The POIs in the test data are distinct from the POIs in the training data.

In [None]:
# Loading the test data
data_test = pd.read_csv('../input/foursquare-location-matching/test.csv')
print(pd.Series({"Memory usage": "{:.5f} MB".format(data_test.memory_usage().sum()/(1024*1024)),
                 "Dataset shape": "{}".format(data_test.shape)}).to_string())
print(" ")
data_test.head()

In [None]:
# A typical observation from the test set
data_test.iloc[0]

## Project Objective

The goal of the project is to match POIs together. Using the provided dataset of over one-and-a-half million places entries, heavily altered to include noise, duplications, extraneous, or incorrect information, the objective is to produce an algorithm that predicts which place entries represent the same POI. Each place entry in the data includes useful attributes like name, street address, and coordinates. Efficient and successful matching of POIs will make it easier to identify where new stores or businesses would benefit people the most.

## Evaluation Metric

**[Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index).** Also known as *Jaccard similarity coefficient*, it is a statistic used for gauging the similarity and diversity of sample sets. It was developed by [Grove Karl Gilbert](https://en.wikipedia.org/wiki/Grove_Karl_Gilbert) in 1884 as his *ratio of verification (v)* and now is frequently referred to as the *Critical Success Index* in meteorology. It was later developed independently by [Paul Jaccard](https://en.wikipedia.org/wiki/Paul_Jaccard), originally giving the French name *coefficient de communauté* and independently formulated again by T. T. Tanimoto. Thus, the *Tanimoto index* or *Tanimoto coefficient* are also used in some fields. However, they are identical in generally taking the ratio of Intersection over Union. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

$$ J(A, B) := \frac{\left\vert A \cap B \right\vert}{\left\vert A \cup B \right\vert} = \frac{\left\vert A \cap B \right\vert}{\left\vert A \right\vert + \left\vert B \right\vert - \left\vert A \cap B \right\vert}. $$

Note that by design, $0\leq J\left(A, B\right)\leq 1$. If $A$ and $B$ are both empty, define $J(A, B) = 1$. The Jaccard coefficient is widely used in computer science, ecology, genomics, and other sciences, where binary or binarized data are used. Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard coefficient. See [this paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3118-5) ([arxiv version](https://arxiv.org/abs/1903.11372)) for details.

Let us assume that for a specific `id` $a$, our algorithm produces three matches $a$, $b$ and $c$ whereas the true matches are $a$, $b$, $d$ and $e$. Then the Jaccard index for the prediction on this particular `id` will be

$$ \frac{\left\vert \left\{a, b, c\right\} \cap \left\{a, b, d, e\right\} \right\vert}{\left\vert \left\{a, b, c\right\} \cup \left\{a, b, d, e\right\} \right\vert} = \frac{\left\vert \left\{a, b\right\} \right\vert}{\left\vert \left\{a, b, c, d, e\right\} \right\vert} = \frac{2}{5}. $$

Thus, while correct matching predictions are rewarded, incorrect matching predictions are penalised by equal measure. The evaluation metric is simply the mean of Jaccard indices for each of the test observations, i.e. if the test data comprises $n_{\text{test}}$ observations and $J_i$ denotes the Jaccard index corresponding to the $i$th test observation, $i = 1,2,\cdots,n_{\text{test}}$, then the final metric by which a model will be evaluated is:

$$ \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} J_i. $$

# 2. Basic Data Exploration

In [None]:
# Shape of the data
print(pd.Series({"Shape of the training set": data_train.shape,
                 "Shape of the pregenerated set of pairs": data_pairs.shape,
                 "Shape of the test set": data_test.shape}).to_string())

In [None]:
# Count of observations in the data
print(pd.Series({"Number of observations in the training set": len(data_train),
                 "Number of observations in the pregenerated set of pairs": len(data_pairs),
                 "Number of observations in the test set": len(data_test)}).to_string())

It is evident that the test set, while it does serve as a snapshot of the far larger test data used in the evaluation process of the submitted predictions in [the competition](https://www.kaggle.com/competitions/foursquare-location-matching), has too few observations to do anything with. We shall have to split the training set and use part of it for testing purpose.

In [None]:
# Count of columns in the data
print(pd.Series({"Number of columns in the training set": len(data_train.columns),
                 "Number of columns in the pregenerated set of pairs": len(data_pairs.columns),
                 "Number of columns in the test set": len(data_test.columns)}).to_string())

In [None]:
# Column names for the training set
data_train.columns.tolist()

In [None]:
# Column names for the pregenerated set of pairs
data_pairs.columns.tolist()

In [None]:
# Column names for the test set
data_test.columns.tolist()

In [None]:
# Columns in the training set which are not in the test set
[col for col in data_train.columns if col not in data_test.columns]

In [None]:
# Column datatypes for the training set
data_train.dtypes

In [None]:
# Count of column datatypes for the training set
print(pd.Series({"Number of integer columns": len(data_train.columns[data_train.dtypes == 'int64']),
                 "Number of float columns": len(data_train.columns[data_train.dtypes == 'float64']),
                 "Number of object columns": len(data_train.columns[data_train.dtypes == 'object'])}).to_string())

In [None]:
# Column datatypes for the pregenerated set of pairs
data_pairs.dtypes

In [None]:
# Count of column datatypes for the pregenerated set of pairs
print(pd.Series({"Number of integer columns": len(data_pairs.columns[data_pairs.dtypes == 'int64']),
                 "Number of float columns": len(data_pairs.columns[data_pairs.dtypes == 'float64']),
                 "Number of object columns": len(data_pairs.columns[data_pairs.dtypes == 'object']),
                 "Number of Boolean columns": len(data_pairs.columns[data_pairs.dtypes == 'bool'])}).to_string())

In [None]:
# Column datatypes for the test set
data_test.dtypes

In [None]:
# Count of column datatypes for the test set
print(pd.Series({"Number of integer columns": len(data_test.columns[data_test.dtypes == 'int64']),
                 "Number of float columns": len(data_test.columns[data_test.dtypes == 'float64']),
                 "Number of object columns": len(data_test.columns[data_test.dtypes == 'object'])}).to_string())

The columns of the test set are exactly the first $12$ columns of the training set, i.e. all columns except `point_of_interest`. The columns of the pairs set are two times replication of these $12$ columns, for two observations, plus the Boolean column `match` which indicates whether or not the two observations refer to the same real-world entity. Note that the columns `zip` and `phone` have object datatype in the training set and `float` datatype in the test set. The underlying reason for this may be the fact that the provided test set has only $5$ observations and hence does not capture the general picture.

In [None]:
# Number of unique values in the training set columns
data_train.nunique()

In [None]:
# Number of unique values in the training set columns by percentage
plt.figure(figsize = (9, 13 / 3))
data_temp = (data_train.nunique() / len(data_train)) * 100
s = sns.barplot(x = data_temp.values, y = data_temp.index)
s.set_xlim(0, 100)
# s.bar_label(s.containers[0])
s.set(xlabel = "% of unique values", ylabel = "column")
plt.tight_layout()
plt.show()

In [None]:
# Number of unique values in the columns of the pregenerated set of pairs
data_pairs.nunique()

In [None]:
# Number of unique values in the columns of the pregenerated set of pairs by percentage
plt.figure(figsize = (9, 25 / 3))
data_temp = (data_pairs.nunique() / len(data_pairs)) * 100
s = sns.barplot(x = data_temp.values, y = data_temp.index)
s.set_xlim(0, 100)
# s.bar_label(s.containers[0])
s.set(xlabel = "% of unique values", ylabel = "column")
plt.tight_layout()
plt.show()

In [None]:
# Number of unique values in the test set columns
data_test.nunique()

In [None]:
# Number of unique values in the test set columns by percentage
plt.figure(figsize = (9, 4))
data_temp = (data_test.nunique() / len(data_test)) * 100
s = sns.barplot(x = data_temp.values, y = data_temp.index)
s.set_xlim(0, 100)
# s.bar_label(s.containers[0])
s.set(xlabel = "% of unique values", ylabel = "column")
plt.tight_layout()
plt.show()

In [None]:
# Count of duplicate rows
print(pd.Series({"Number of duplicate rows in the training set": data_train.duplicated().sum(),
                 "Number of duplicate rows in the pregenerated set of pairs": data_pairs.duplicated().sum(),
                 "Number of duplicate rows in the test set": data_test.duplicated().sum()}).to_string())

In [None]:
# Constant columns in the training set
cols_constant_train = data_train.columns[data_train.nunique() == 1].tolist()
if len(cols_constant_train) == 0:
    cols_constant_train = "None"
print(pd.Series({"Constant columns in the training set": cols_constant_train}).to_string())

In [None]:
# Constant columns in the pregenerated set of pairs
cols_constant_pairs = data_pairs.columns[data_pairs.nunique() == 1].tolist()
if len(cols_constant_pairs) == 0:
    cols_constant_pairs = "None"
print(pd.Series({"Constant columns in the pregenerated set of pairs": cols_constant_pairs}).to_string())

In [None]:
# Constant columns in the test set
cols_constant_test = data_test.columns[data_test.nunique() == 1].tolist()
if len(cols_constant_test) == 0:
    cols_constant_test = "None"
print(pd.Series({"Constant columns in the test set": cols_constant_test}).to_string())

In [None]:
# Count of columns with missing values
print(pd.Series({"Number of columns with missing values in the training set": len(data_train.isna().sum()[data_train.isna().sum() != 0]),
                 "Number of columns with missing values in the pregenerated set of pairs": len(data_pairs.isna().sum()[data_pairs.isna().sum() != 0]),
                 "Number of columns with missing values in the test set": len(data_test.isna().sum()[data_test.isna().sum() != 0])}).to_string())

In [None]:
# Columns with missing values in the training set with respective proportion of missing values
(data_train.isna().sum()[data_train.isna().sum() != 0]/len(data_train)).sort_values(ascending = False)

In [None]:
# Missing values in the training set
plt.figure(figsize = (9, 13 / 3))
data_temp = (data_train.isna().sum() * 100 / len(data_train)) #.sort_values()
s = sns.barplot(x = data_temp.values, y = data_temp.index)
s.set_xlim(0, 100)
# s.bar_label(s.containers[0])
s.set(xlabel = "% of missing values", ylabel = "column")
plt.tight_layout()
plt.show()

Three columns `zip`, `phone` and `url` have over $50\%$ values missing, while two more columns `address` and `state` have over $30\%$ values missing.

In [None]:
# Columns with missing values in the pregenerated set of pairs with respective proportion of missing values
(data_pairs.isna().sum()[data_pairs.isna().sum() != 0]/len(data_pairs)).sort_values(ascending = False)

In [None]:
# Missing values in the pregenerated set of pairs
plt.figure(figsize = (9, 25 / 3))
data_pairs_missing = (data_pairs.isna().sum() * 100 / len(data_pairs)) #.sort_values()
s = sns.barplot(x = data_pairs_missing.values, y = data_pairs_missing.index)
s.set_xlim(0, 100)
# s.bar_label(s.containers[0])
s.set(xlabel = "% of missing values", ylabel = "column")
plt.tight_layout()
plt.show()

In [None]:
# Columns with missing values in the test set with respective proportion of missing values
(data_test.isna().sum()[data_test.isna().sum() != 0]/len(data_test)).sort_values(ascending = False)

In [None]:
# Missing values in the test set
plt.figure(figsize = (9, 4))
data_test_missing = (data_test.isna().sum() * 100 / len(data_test)) #.sort_values()
s = sns.barplot(x = data_test_missing.values, y = data_test_missing.index)
s.set_xlim(0, 100)
# s.bar_label(s.containers[0])
s.set(xlabel = "% of missing values", ylabel = "column")
plt.tight_layout()
plt.show()

In [None]:
# Statistical description of numerical variables in the training set
data_train.describe()

In [None]:
# Statistical description of categorical variables in the training set
data_train.describe(include = ['O'])

In [None]:
# Statistical description of numerical variables in the pregenerated set of pairs
data_pairs.describe()

In [None]:
# Statistical description of categorical variables in the pregenerated set of pairs
data_pairs.describe(include = ['O'])

In [None]:
# Statistical description of numerical variables in the test set
data_test.describe()

In [None]:
# Statistical description of categorical variables in the test set
data_test.describe(include = ['O'])

**Training set synopsis:**

- Number of observations: $1138812$
- Number of columns: $13$
- Number of integer columns: $0$
- Number of float columns: $2$
- Number of object columns: $11$
- Number of duplicate observations: $0$
- Constant columns: None
- Number of columns with missing values: $9$
- Memory Usage: $112.95$ MB

**Pregenerated set of pairs synopsis:**

- Number of observations: $578907$
- Number of columns: $25$
- Number of integer columns: $2$
- Number of float columns: $30$
- Number of object columns: $1$
- Number of duplicate observations: $0$
- Constant columns: None
- Number of columns with missing values: $16$
- Memory Usage: $106.55$ MB

**Test set synopsis:**

- Number of observations: $5$
- Number of columns: $12$
- Number of integer columns: $0$
- Number of float columns: $4$
- Number of object columns: $8$
- Number of duplicate observations: $0$
- Constant columns: `url`, `phone`
- Number of columns with missing values: $6$
- Memory Usage: $0.00058$ MB

# 3. Univariate Analysis - Training Set

- [Point of Interest](#Point-of-Interest)
- [Latitude and Longitude](#Latitude-and-Longitude)
- [Country](#Country)
- [State](#State)
- [City](#City)
- [Categories](#Categories)
- [Name](#Name)
- [Address](#Address)
- [Phone](#Phone)
- [URL](#URL)

## Point of Interest

In [None]:
# Horizontal countplot of state of training observations located in the United States of America
cutoff = min(len(data_train['point_of_interest'].value_counts()), 50)
order_descending = data_train.groupby('point_of_interest').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train, y = 'point_of_interest', order = order_descending)
plt.title(f"Top {cutoff} POIs in the training set with highest frequencies", fontsize = 14)
plt.tight_layout()
plt.show()

We observe a big number of POI matching, with `P_fb339198a31db3` appearing as many as $332$ times.

## Latitude and Longitude

In [None]:
# Histograms of latitude and longitude of observations in the training set
fig, ax = plt.subplots(1, 2, figsize = (15, 6), sharey = True)
sns.histplot(data = data_train, x = 'latitude', bins = 30, ax = ax[0])
sns.histplot(data = data_train, x = 'longitude', bins = 30, ax = ax[1])
ax[1].set_ylabel(" ")
plt.tight_layout()
plt.show()

We observe that a big chunk of observations fall inside the latitude interval $30$ to $60$, which covers much of `United States of America` as well as many countries from `Europe`. There are not many observations below $-50$ and above $70$ as these point to the two polar regions and their surrounding areas, which expectedly do not contain many POIs.

As for the longitude, we can see three separate groups. The first group is from $-125$ to $-25$, which covers `North America` and `South America`. The second group is from $-25$ to $75$, which covers `Europe` and `Africa`. The third and final group is from $75$ to $175$, which covers `Asia` and `Australia`. There four region of troughs in the histogram of longitudes. The first and the forth troughs (in the extreme left and the extreme right, which are joined because Earth is round) are due to the Pacific ocean, the second trough is due to the Atlantic ocean, where as the third trough is due to the Indian ocean, as well as lack of observations from some parts of `Russia` and the west part of `China`. A scatterplot of latitude and longitude, which is given in the next section, provides a clearer picture of the distribution of location of the training observations.

Next, we present horizontal countplots of the object-type columns. We begin by presenting the top countries with most training observations. For labelling purpose, we use a .json file which contains the [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) codes for $246$ countries or territories.

## Country

In [None]:
# ISO 3166-1 alpha-2 country codes
url_iso_alpha_2 = "https://raw.githubusercontent.com/sugatagh/Foursquare-Location-Matching/main/JSON/ISO_3166-1_alpha-2.json"
dict_iso_alpha_2 = pd.read_json(url_iso_alpha_2, typ = 'series')

In [None]:
# Horizontal countplot of country of observations in the training set
data_train_country = pd.DataFrame()
data_train_country['country'] = data_train['country'].map(dict_iso_alpha_2)
cutoff = min(len(data_train_country['country'].value_counts()), 50)
order_descending = data_train_country.groupby('country').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train_country, y = 'country', order = order_descending)
plt.title("Top countries with most observations in the training set", fontsize = 14)
plt.tight_layout()
plt.show()

We observe that `United States` has the highest number of training observations by a large margin, followed by `Turkey` and `Indonesia`.

## State

In [None]:
# Horizontal countplot of state of training observations
cutoff = min(len(data_train['state'].value_counts()), 75)
order_descending = data_train.groupby('state').size().sort_values().index[::-1][: cutoff].tolist()
fig, ax = plt.subplots(1, 1, figsize = (15, cutoff / 5), sharex = True)
sns.countplot(data = data_train, y = 'state', order = order_descending, ax = ax)
ax.set_title("Top states in the world with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

The state of California (`CA`) comes out on top, followed by New York (`NY`) and Florida (`FL`). Often same state is reported in different names, caused by the variation in full name, short name, [postal abbreviations](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations), as well as uppercase and lowercase. For example, `CA`, `Calif`, `California`, `Ca` or `NY`, `New York`, `Ny`, `ny` or `Texas`, `Tx`, `tx`. It may be useful to match the state names with the respective postal abbreviations using a dictionary or otherwise. Also, to get rid of the variation due to capitalization of letters in general, we may convert all the reported object values (not just for the `state` column) to lowercase.

## City

In [None]:
# Horizontal countplot of city of training observations
cutoff = min(len(data_train['city'].value_counts()), 50)
order_descending = data_train.groupby('city').size().sort_values().index[::-1][: cutoff].tolist()
fig, ax = plt.subplots(1, 1, figsize = (15, cutoff / 5), sharex = True)
sns.countplot(data = data_train, y = 'city', order = order_descending, ax = ax)
ax.set_title("Top cities in the world with most observations", fontsize = 14)
plt.tight_layout()
plt.show()

`Singapore` has most training observations, followed closely by `Mockba` and `Bandung`.

## ZIP Code

In [None]:
# Horizontal countplot of ZIP code of training observations
cutoff = min(len(data_train['zip'].value_counts()), 50)
order_descending = data_train.groupby('zip').size().sort_values().index[::-1][: cutoff].tolist()
fig, ax = plt.subplots(1, 1, figsize = (15, cutoff / 5), sharex = True)
sns.countplot(data = data_train, y = 'zip', order = order_descending, ax = ax)
ax.set_title("Top ZIP codes in the world with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

We observe that the ZIP code with highest number of training observations is `9000`, followed by `10330` and `10110`. Note that postal/zip codes may not be unique globally. So the plot in the right side can be misleading.

## Categories

In [None]:
# Horizontal countplot of categories of training observations in the world
cutoff = min(len(data_train['categories'].value_counts()), 50)
order_descending = data_train.groupby('categories').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train, y = 'categories', order = order_descending)
plt.title("Top categories in the world with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

`Residential Buildings (Apartments / Condos)` has the highest number of training observations, followed by `Banks` and `College classrooms`.

## Name

In [None]:
# Horizontal countplots of name of training observations
cutoff = min(len(data_train['name'].value_counts()), 50)
order_descending = data_train.groupby('name').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train, y = 'name', order = order_descending)
plt.title("Top names with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

`Starbucks` tops the list with most highest of training observations, followed by `McDonald's` and `Redbox`.

## Address

In [None]:
# Horizontal countplots of address of training observations
cutoff = min(len(data_train['address'].value_counts()), 50)
order_descending = data_train.groupby('address').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train, y = 'address', order = order_descending)
plt.title("Top addresses with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

As expected, generic strings that are typically used as part of address feature as the top most addresses of the training observations, with `Terminal 1` having the highest frequency.

## Phone

In [None]:
# Horizontal countplots of phone numbers of training observations
cutoff = min(len(data_train['phone'].value_counts()), 50)
order_descending = data_train.groupby('phone').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train, y = 'phone', order = order_descending)
plt.title("Top phone numbers with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

The number `8667332693` leads by far, followed by itself with the extension `+1`. It shows how same phone number can have variation in the records due to addition or omission of extensions. Removing the extensions at the *data preprocessing* stage may be helpful to avoid same phone numbers from getting identified as different.

## URL

In [None]:
# Horizontal countplots of URL of training observations
cutoff = min(len(data_train['url'].value_counts()), 50)
order_descending = data_train.groupby('url').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_train, y = 'url', order = order_descending)
plt.title("Top URLs with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

The website `https://www.sej.co.jp` is associated with the highest number of training observations, followed by `http://www.7eleven.co.th` and `http://www.payless.com/`. Observe the possibility of undesired variation due to the strings `https://`, `http://`, `www` or an appended frontslash at the end. It may be helpful to get rid of these strings and retain only the *core* part of the URLs at the *data preprocessing* stage.

# 4. Multivariate Analysis - Training Set

- [Latitude and Longitude](#Latitude-and-Longitude)
- [States of USA](#States-of-USA)
- [Cities of California, USA](#Cities-of-California,-USA)
- [ZIP codes of San Francisco, California, USA](#ZIP-codes-of-San-Francisco,-California,-USA)
- [Categories of POIs in San Francisco - 94103, California, USA](#Categories-of-POIs-in-San-Francisco---94103,-California,-USA)
- [Names of coffee shops in San Francisco - 94103, California, USA](#Names-of-coffee-shops-in-San-Francisco---94103,-California,-USA)
- [Address, Phone and URL of Starbucks coffee shops in San Francisco - 94103, California, USA](#Address,-Phone-and-URL-of-Starbucks-coffee-shops-in-San-Francisco---94103,-California,-USA)

## Latitude and Longitude

In [None]:
# Scatterplot of latitude and longitude of observations in the training set
plt.figure(figsize = (15, 9))
sns.scatterplot(data = data_train, x = 'longitude', y = 'latitude')
plt.title("Scatterplot of latitude and longitude of training observations", fontsize = 14)
plt.tight_layout()
plt.show()

This scatterplot gives a clearer picture of the locations than the histograms for the latitude and the longitudes of POIs in the training data, as presented in the previous section, because they only captured marginal information on the location of the observations. The present plot roughly resembles the world map, although there are relatively less number of POIs from `Canada`, `Australia`, some parts of `Russia` and the countries from `Africa` (apart from the southeast part). Observations are dense and uniform over `United States of America`, `Mexico`, `New Zealand`, almost entirety of `India`, east coast of `South America`, the countries in `Europe` and `Southeast Asia`, as well as the east coast of `Australia`. Next, we present horizontal countplots of the object-type columns in a hierarchical manner:

$$ \text{World} \mapsto \text{USA} \mapsto \text{California} \mapsto \text{San Francisco} \mapsto \text{ZIP}\,\,94103 \mapsto \text{Coffee Shops} \mapsto \text{Starbucks}$$

## States of USA

In [None]:
# Horizontal countplot of state of training observations located in the United States of America
data_temp = data_train[data_train['country'] == 'US']
cutoff = min(len(data_temp['state'].value_counts()), 50)
order_descending_temp = data_temp.groupby('state').size().sort_values().index[::-1][: cutoff].tolist()
fig, ax = plt.subplots(1, 1, figsize = (15, cutoff / 5), sharex = True)
sns.countplot(data = data_temp, y = 'state', order = order_descending_temp, ax = ax)
ax.set_title("Top states in USA with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

In the USA, the state of California (`CA`) comes out on top, followed by New York (`NY`) and Florida (`FL`). In fact, seven of the top ten states in the world are from USA itself. Next we explore the top cities of the state of California, with most training observations.

## Cities of California, USA

In [None]:
# Horizontal countplot of city of training observations located in the state of California, USA
data_temp = data_train[(data_train['country'] == 'US') 
                       & (data_train['state'] == 'CA')]
cutoff = min(len(data_temp['city'].value_counts()), 50)
order_descending_temp = data_temp.groupby('city').size().sort_values().index[::-1][: cutoff].tolist()
fig, ax = plt.subplots(1, 1, figsize = (15, cutoff / 5), sharex = True)
sns.countplot(data = data_temp, y = 'city', order = order_descending_temp, ax = ax)
ax.set_title("Top cities in California, USA with most observations", fontsize = 14)
plt.tight_layout()
plt.show()

In the state of California, the city of `San Francisco` has most of the training observations, followed closely by `Los Angeles` and `San Diego`. Next we analyze the top ZIP codes in the city of `San Francisco`, covering most training observations.

## ZIP codes of San Francisco, California, USA

In [None]:
# Horizontal countplot of ZIP code of training observations in the city of San Francisco, state of California, USA
data_temp = data_train[(data_train['country'] == 'US') 
                       & (data_train['state'] == 'CA') 
                       & (data_train['city'] == 'San Francisco')]
cutoff = min(len(data_temp['zip'].value_counts()), 50)
order_descending_temp = data_temp.groupby('zip').size().sort_values().index[::-1][: cutoff].tolist()
fig, ax = plt.subplots(1, 1, figsize = (15, cutoff / 5), sharex = True)
sns.countplot(data = data_temp, y = 'zip', order = order_descending_temp, ax = ax)
ax.set_title("Top ZIP codes in San Francisco, California, USA with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

We observe that in the city of `San Francisco`, the ZIP code with highest number of training observations is `94103`, followed by `94105`, `94107` and `94102`. We shall move onto the *categories* of the POIs and focus only on the observations from `San Francisco - 94103`.

## Categories of POIs in San Francisco - 94103, California, USA

In [None]:
# Horizontal countplot of categories of training observations in the city of San Francisco - 94103, state of California, USA
data_temp = data_train[(data_train['country'] == 'US') 
                       & (data_train['state'] == 'CA') 
                       & (data_train['city'] == 'San Francisco') 
                       & (data_train['zip'] == '94103')]
cutoff = min(len(data_temp['categories'].value_counts()), 50)
order_descending = data_temp.groupby('categories').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 5))
sns.countplot(data = data_temp, y = 'categories', order = order_descending)
plt.title("Top categories in San Francisco - 94103, California, USA with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

Four categories have highest number of training observations in the specified ZIP code: `Food Trucks`, `Meeting Rooms`, `Offices` and `Coffee Shops`. Next we shall see the `name` of the `Coffee Shops` in `San Francisco - 94103, California, USA` region considered above.

## Names of coffee shops in San Francisco - 94103, California, USA

In [None]:
# Horizontal countplot of names of training observations which are coffee shops in the city of San Francisco - 94103, state of California, USA
data_temp = data_train[(data_train['country'] == 'US') 
                       & (data_train['state'] == 'CA') 
                       & (data_train['city'] == 'San Francisco') 
                       & (data_train['zip'] == '94103') 
                       & (data_train['categories'] == 'Coffee Shops')]
cutoff = min(len(data_temp['name'].value_counts()), 75)
order_descending = data_temp.groupby('name').size().sort_values().index[::-1][: cutoff].tolist()
plt.figure(figsize = (15, cutoff / 2))
sns.countplot(data = data_temp, y = 'name', order = order_descending)
plt.title("Names of coffee shops in San Francisco - 94103, California, USA with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

We observe that there are five`Starbucks` coffee shops in the region. We shall check `address`, `phone` and `url` to see if any pair of these shops are identical.

## Address, Phone and URL of Starbucks coffee shops in San Francisco - 94103, California, USA

In [None]:
# Horizontal countplots of address, phone and url of training observations which are coffee shops in the city of San Francisco - 94103, state of California, USA
data_temp = data_train[(data_train['country'] == 'US') 
                       & (data_train['state'] == 'CA') 
                       & (data_train['city'] == 'San Francisco') 
                       & (data_train['zip'] == '94103') 
                       & (data_train['categories'] == 'Coffee Shops')
                       & (data_train['name'] == 'Starbucks')]
fig, ax = plt.subplots(3, 1, figsize = (15, 9))
sns.countplot(data = data_temp, y = 'address', ax = ax[0])
ax[0].set_title("Addresses of Starbucks coffee shops in San Francisco - 94103, California, USA with most training observations", fontsize = 14)
sns.countplot(data = data_temp, y = 'phone', ax = ax[1])
ax[1].set_title("Phone numbers of Starbucks coffee shops in San Francisco - 94103, California, USA with most training observations", fontsize = 14)
sns.countplot(data = data_temp, y = 'url', ax = ax[2])
ax[2].set_title("URLs of Starbucks coffee shops in San Francisco - 94103, California, USA with most training observations", fontsize = 14)
plt.tight_layout()
plt.show()

Thus all the Starbucks coffee shops in the region have distinct addresses, indicating that there are no matches among them. Furthermore the available *phone* numbers (which is missing for one out of the five shops) are also distinct. Note that the available urls, though appear distinct, have many matches and are only distinct due to variations in *https*, *http*, *www* etc. One may get rid of strings like *https://*, *http://*, *www.*, *https://www.*, *http://www.* etc to overcome this issue.

# 5. Univariate Analysis - Pairs Set

- [Match](#Match)
- [Feature Based on Latitude and Longitude](#Feature-Based-on-Latitude-and-Longitude)
- [Features Based on Levenshtein Distance](#Features-Based-on-Levenshtein-Distance)
- [Features Based on Matching](#Features-Based-on-Matching)
- [Distance between Locations](#Distance-between-Locations)
- [Distance between Names](#Distance-between-Names)
- [Distance between Addresses](#Distance-between-Addresses)
- [Distance between URLs](#Distance-between-URLs)
- [Distance between Phone Numbers](#Distance-between-Phone-Numbers)
- [Matching of Countries](#Matching-of-Countries)
- [Matching of States](#Matching-of-States)
- [Matching of Cities](#Matching-of-Cities)
- [Matching of ZIP Codes](#Matching-of-ZIP-Codes)
- [Matching of Categories](#Matching-of-Categories)

In [None]:
# Barplot and donutplot of a dataframe column
def count(df, col):
    fig = make_subplots(rows = 1, cols = 2, specs = [[{'type': 'xy'}, {'type': 'domain'}]])
    x_val = df[col].value_counts(sort = False).index.tolist()
    y_val = df[col].value_counts(sort = False).tolist()
    fig.add_trace(go.Bar(x = x_val, y = y_val, text = y_val, textposition = 'auto'), row = 1, col = 1)
    fig.add_trace(go.Pie(values = y_val, labels = x_val, hole = 0.5, textinfo = 'label+percent', title = f"{col}"), row = 1, col = 2)
    fig.update_layout(height = 500, width = 800, showlegend = False, xaxis = dict(tickmode = 'linear', tick0 = 0, dtick = 1), title = dict(text = f"Frequency distribution of {col}", x = 0.5, y = 0.95)) 
    fig.show()

## Match

In [None]:
# Donutplot of the 'match' column
count(data_pairs, 'match')

We observe that $68.9\%$ of the pairs in the pregenerated set of pairs match, while $31.1\%$ of the pairs do not match. Next, we consider a typical observation from the pairs data.

In [None]:
# Typical observation from the pregenerated set of pairs
data_pairs.iloc[0]

It contains a pair of observations. The *id* of the two observations are expectedly different. The *name* is slightly different, as are the *latitude* and *longitude*. *Country* and *category* are identical. Some of the attributes are missing in one of the observations, while some are missing in both. The target variable here is `match`, which is a Boolean variable taking the value `True` if the two observations refer to the same POI and `False` otherwise. We observe that the number of features can be greatly reduced if we focus on the information that are relevant in predicting `match`, and discard the rest. We initiate a dataframe to extract and store these relevant information out of the attributes in `data_pairs`.

In [None]:
# Dataframe initialization for new features based on the pregenerated set of pairs
data_pairs_feat = pd.DataFrame()

## Feature Based on Latitude and Longitude

The information that is relevant in predicting `match`, contained in `latitude_1`, `longitude_1`, `latitude_2` and `longitude_2`, can be encapsulted into a single variable, which is the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) `dist_loc` between the two locations (`latitude_1`, `longitude_1`) and (`latitude_2`, `longitude_2`), given by

$$ \text{dist_loc} = \sqrt{\left(\text{latitude_1} - \text{latitude_2}\right)^2 + \left(\text{longitude_1} - \text{longitude_2}\right)^2}. $$

In [None]:
# Distance between locations
data_pairs_feat['dist_loc'] = np.sqrt(((data_pairs['latitude_1'] - data_pairs['latitude_2'])**2) + ((data_pairs['longitude_1'] - data_pairs['longitude_2'])**2))

## Features Based on Levenshtein Distance

In the typical observation printed at the beginning of the section, we see the name of the same POI is reported as `Café Stad Oudenaarde` in one record, and `Café Oudenaarde` in another. We use a quantification of distance between two strings to capture the extent of this difference between the reported names of a pair of observations. For example, we would expect the difference between `Café Stad Oudenaarde` and `Café Oudenaarde` to be lesser than the same between `Turkcell` and `Island Spa Theater`. Here we shall employ the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance), which measures the difference between two sequences. It is named after the Soviet mathematician [Vladimir Levenshtein](https://en.wikipedia.org/wiki/Vladimir_Levenshtein), who considered this distance in 1965. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. Mathematically, the Levenshtein distance between two strings $a$ and $b$ (of length $\left\vert a \right\vert$ and $\left\vert b \right\vert$ respectively) is defined recursively as

$$ \text{lev(a, b)} =
\left\{
    \begin{array}{ll}
        \left\vert a \right\vert, & \mbox{if } \left\vert b \right\vert == 0, \\
        \left\vert b \right\vert, & \mbox{if } \left\vert a \right\vert == 0, \\
        \text{lev(tail(a), tail(b))}, & \mbox{if } a[0] == b[0], \\
        1 + \min \left\{
                     \begin{array}{l}
                         \text{lev(tail(a), b)} \\
                         \text{lev(a, tail(b))} \\
                         \text{lev(tail(a), tail(b))}
                     \end{array}
                 \right., & \mbox{otherwise},
    \end{array}
\right. $$

where the *tail* of some string $x$ is a string of all but the first character of $x$, for example `tail('Levenshtein') == 'evenshtein'`, and $x[n]$ is the $n$th character of the string $x$, counting from $0$, for example `'Levenshtein'[0] == 'L'`. Note that the first element in the minimum, `lev(tail(a), b)`, corresponds to *deletion* (from $a$ to $b$), the second `lev(a, tail(b))` to *insertion* and the third `lev(tail(a), tail(b))` to *replacement*.

In [None]:
# Recursive function to compute Levenshtein distance
def lev(a, b):
    @lru_cache(None)
    def min_dist(s1, s2):
        if len(a) == s1:
            return len(b) - s2
        elif len(b) == s2:
            return len(a) - s1
        elif a[s1] == b[s2]:
            return min_dist(s1 + 1, s2 + 1)
        else:
            return 1 + min(
                min_dist(s1, s2 + 1),
                min_dist(s1 + 1, s2),
                min_dist(s1 + 1, s2 + 1)
            )
    return min_dist(0, 0)

In [None]:
# Computation of Levenshtein distance for two examples
lev1 = lev('Café Stad Oudenaarde', 'Café Oudenaarde')
lev2 = lev('Turkcell', 'Island Spa Theater')
print(pd.Series({"Levenshtein distance between 'Café Stad Oudenaarde' and 'Café Oudenaarde'": lev1,
                 "Levenshtein distance between 'Turkcell' and 'Island Spa Theater'": lev2}).to_string())

We observe in the pregenerated set of pairs that often `name`, `address`, `url`, `phone` for the same POI varies in different records for variety of reasons. Some records may contain shortened versions of the names. Addresses may vary in the depth of detailing. URLs may vary by presence or omission of strings like *http*, *https*, *www*. Phone numbers may vary due to extensions, brackets, space and symbols like $+$ or hyphen. For these reasons, we create distance features based on these attributes using the Levenshtein distance. Note that we have to convert some of the input data to string format before feeding it to the function that computes the distance.

In [None]:
# New features based on Levenshtein distance
dist_name_list = []
dist_address_list = []
dist_url_list = []
dist_phone_list = []

for i in range(len(data_pairs)):
    dist_name_list.append(lev(data_pairs['name_1'][i], data_pairs['name_2'][i]))
    dist_address_list.append(lev(str(data_pairs['address_1'][i]), str(data_pairs['address_2'][i])))
    dist_url_list.append(lev(str(data_pairs['url_1'][i]), str(data_pairs['url_2'][i])))
    dist_phone_list.append(lev(str(data_pairs['phone_1'][i]), str(data_pairs['phone_2'][i])))

data_pairs_feat['dist_name'] = dist_name_list
data_pairs_feat['dist_address'] = dist_address_list
data_pairs_feat['dist_url'] = dist_url_list
data_pairs_feat['dist_phone'] = dist_phone_list

## Features Based on Matching

As there are no scope of confusion about the country of a POI, the two features `country_1` and `country_2` can be replaced by a binary variable `country_match` which is `True` when `country_1 == country_2` and `False` otherwise.

$$ \text{match_country} =
\left\{
    \begin{array}{ll}
        \text{True,}  & \mbox{if country_1 == country_2,}\\
        \text{False,} & \mbox{otherwise.}
    \end{array}
\right. $$

If `match_country == False` then it is a certain indication towards `match == False`. Similarly, we can construct `match_city`, `match_state`, `match_zip`, `match_categories`, respectively based on the columns on `city`, `state`, `zip`, `phone`, `categories`. We keep the `match` column as it is.

In [None]:
# New features based on matching
def condition(df, col1, col2, i):
    return (str(df[col1][i]) == 'nan' or str(df[col2][i]) == 'nan')
def value(df, col1, col2, i):
    return (df[col1][i] == df[col2][i])
def match_list(df, col1, col2):
    return [np.nan if condition(df, col1, col2, i) else value(df, col1, col2, i) for i in range(len(df))]

data_pairs_feat['match_country'] = match_list(data_pairs, 'country_1', 'country_2')
data_pairs_feat['match_city'] = match_list(data_pairs, 'city_1', 'city_2')
data_pairs_feat['match_state'] = match_list(data_pairs, 'state_1', 'state_2')
data_pairs_feat['match_zip'] = match_list(data_pairs, 'zip_1', 'zip_2')
data_pairs_feat['match_categories'] = match_list(data_pairs, 'categories_1', 'categories_2')
data_pairs_feat['match'] = data_pairs['match']

In [None]:
# Dataframe of new features and match, based on the pregenerated set of pairs
print(pd.Series({"Memory usage": "{:.2f} MB".format(data_pairs_feat.memory_usage().sum()/(1024*1024)),
                 "Dataframe shape": "{}".format(data_pairs_feat.shape)}).to_string())
print(" ")
data_pairs_feat.head()

## Distance between Locations

This variable is extremely skewed. To deal with it, we have applied the following transformation: $x \mapsto log(x+\epsilon),$ where $\epsilon$ is a very small positive real number. Here we have taken $\epsilon = 0.00000001$. The reason behind making this small shift to the data is that the log function maps $0$ to $-\infty$. The shift keeps the transformed data finite, and keeping $\epsilon$ small ensures that the data points which were originally $0,$ stands out from the rest in the transformed setup. Visualizations of the distribution of both the original feature and the transformed feature have been shown.

In [None]:
# Histogram of distance between locations
fig, ax = plt.subplots(1, 2, figsize = (15, 6), sharex = False, sharey = False)
data_temp = data_pairs_feat.copy(deep = True)
epsilon = 0.00000001
data_temp['dist_loc_transformed'] = data_temp['dist_loc'].apply(lambda x: np.log(x + epsilon))
sns.histplot(data = data_temp, x = 'dist_loc', bins = 30, hue = 'match', ax = ax[0])
sns.histplot(data = data_temp, x = 'dist_loc_transformed', bins = 30, hue = 'match', ax = ax[1])
plt.tight_layout()
plt.show()

## Distance between Names

In [None]:
# Histogram of distance between names
plt.figure(figsize = (9, 6))
sns.histplot(data = data_pairs_feat, x = 'dist_name', bins = 30, hue = 'match')
plt.tight_layout()
plt.show()

## Distance between Addresses

In [None]:
# Histogram of distance between addresses
plt.figure(figsize = (9, 6))
sns.histplot(data = data_pairs_feat, x = 'dist_address', bins = 30, hue = 'match')
plt.tight_layout()
plt.show()

Apart from the global peak at $0,$ we observe a noticeable local peak between $10$ and $15$. The reasons behind this local peak may be more than one. It may happen that the addresses are short, and hence the Leveshtein distance between the two addresses in a pair is small. It may also happen that the two observations in the pair report the same address, but at different level of detailing, which surely contributes to substantial amount of pairs with Leveshtein distance between addresses more than $0,$ but less than $15$.

## Distance between URLs

In [None]:
# Histogram of distance between URLs
plt.figure(figsize = (9, 6))
sns.histplot(data = data_pairs_feat, x = 'dist_url', bins = 30, hue = 'match')
plt.tight_layout()
plt.show()

## Distance between Phone Numbers

In [None]:
# Histogram of distance between phone numbers
plt.figure(figsize = (9, 6))
sns.histplot(data = data_pairs_feat, x = 'dist_phone', bins = 30, hue = 'match')
plt.tight_layout()
plt.show()

The local peak around $10,$ which is expected, as most phone numbers have $10$ digits. Thus two different phone numbers of $10$ digits, for which none of the digits are same, has Leveshtein distance $10$ (as it requires $10$ substitutions to go from one number to the another). Of course, it can be more because country extension codes can be involved.

## Matching of Countries

In [None]:
# Frequency distribution of a feature
"""
Input: Dataframe <df>, categorical column <col>, Boolean target variable <boolean_target>
Output (top left): Barplot of df[col]
Output (top right): Donutplot of df[col]
Output (bottom left): Donutplot of df[col] when boolean_target == True
Output (bottom left): Donutplot of df[col] when boolean_target == False
"""
def donut(df, col, boolean_target):
    fig = make_subplots(rows = 1, cols = 2, specs = [[{'type': 'domain'}, {'type': 'domain'}]])
    x_val_true = df[df[boolean_target] == True][col].value_counts(sort = False).index.tolist()
    y_val_true = df[df[boolean_target] == True][col].value_counts(sort = False).tolist()
    fig.add_trace(go.Pie(values = y_val_true, labels = x_val_true, hole = 0.5, textinfo = 'label+percent', title = f"{boolean_target} = True"), row = 1, col = 1)
    x_val_false = df[df[boolean_target] == False][col].value_counts(sort = False).index.tolist()
    y_val_false = df[df[boolean_target] == False][col].value_counts(sort = False).tolist()
    fig.add_trace(go.Pie(values = y_val_false, labels = x_val_false, hole = 0.5, textinfo = 'label+percent', title = f"{boolean_target} = False"), row = 1, col = 2)
    fig.update_layout(height = 500, width = 800, showlegend = False, xaxis = dict(tickmode = 'linear', tick0 = 0, dtick = 1), title = dict(text = f"Frequency distribution of {col} by {boolean_target}", x = 0.5, y = 0.95)) 
    fig.show()

In [None]:
# Barplot and donutplot of 'match_country'
count(data_pairs_feat, 'match_country')

In [None]:
# Donutplots of 'match_country' for different target classes
donut(data_pairs_feat, 'match_country', 'match')

Curiously, the pregenerated set of pairs predominantly $(99.7\%)$ contains pairs of observations from same country. One reason perhaps is that if two observations are reported to be from different countries, then they certainly refer to different POIs (unless something is horrendously wrong). So those pairs can be classified as `match == False` with reasonable confidence from common sense itself, without requiring any classification algorithm. Although there may be errors due to faulty records, as seen by $0.4\%$ observations in the `match == True` class, for which `match_country == False`. Nonetheless, for the modeling purpose, it maybe useful to focus mostly on the pairs that have observations from the same country.

## Matching of States

In [None]:
# Barplot and donutplot of 'match_state'
count(data_pairs_feat, 'match_state')

In [None]:
# Donutplots of 'match_state' for different target classes
donut(data_pairs_feat, 'match_state', 'match')

## Matching of Cities

In [None]:
# Barplot and donutplot of 'match_city'
count(data_pairs_feat, 'match_city')

In [None]:
# Donutplots of 'match_city' for different target classes
donut(data_pairs_feat, 'match_city', 'match')

## Matching of ZIP Codes

In [None]:
# Barplot and donutplot of 'match_zip'
count(data_pairs_feat, 'match_zip')

In [None]:
# Donutplots of 'match_zip' for different target classes
donut(data_pairs_feat, 'match_zip', 'match')

## Matching of Categories

In [None]:
# Barplot and donutplot of 'match_categories'
count(data_pairs_feat, 'match_categories')

In [None]:
# Donutplots of 'match_categories' for different target classes
donut(data_pairs_feat, 'match_categories', 'match')

Unlike other matching attributes, the `matching_categories` is *false* for majority of the pairs in the pregenerated set, i.e. for majority of pairs, category of the two observations do not match.

# 6. Multivariate Analysis - Pairs Set

- [Correlation structure of numerical features](#Correlation-structure-of-numerical-features)
- [Bivariate scatterplots of numerical features](#Bivariate-scatterplots-of-numerical-features)
- [Trivariate scatterplots of numerical features](#Trivariate-scatterplots-of-numerical-features)
- [Contingency tables of categorical features](#Contingency-tables-of-categorical-features)
- [Numerical features for different classes of categorical features](#Numerical-features-for-different-classes-of-categorical-features)

## Correlation structure of numerical features

[Correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is a statistical measure of linear dependence between two variables. Extreme correlation gives an indication that the two variables are linearly related, however this does not prove any causal relationship between the said variables. The measure is defined as the covariance of the two variables, scaled by the product of respective standard deviations. Let $\left\{\left(x_1, y_1\right), \left(x_2, y_2\right), \cdots, \left(x_n, y_n\right)\right\}$ be paired data on the variables $\left(x, y\right)$. Then the correlation coefficient of the two variables is given by
$$ r_{xy} := \frac{\text{cov}\left(x, y\right)}{s_x s_y} = \frac{\frac{1}{n}\sum_{i=1}^n\left(x_i - \bar{x}\right)\left(y_i - \bar{y}\right)}{\sqrt{\frac{1}{n}\sum_{i=1}^n\left(x_i - \bar{x}\right)^2} \sqrt{\frac{1}{n}\sum_{i=1}^n\left(y_i - \bar{y}\right)^2}},$$

where $\bar{x}$ and $\bar{y}$ denote the respective sample means of the two variables, given by $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ and $\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$.

In [None]:
# Correlation coefficients of pairs of numerical features
df_corr = pd.DataFrame(columns = ['feature_1', 'feature_2', 'all pairs', 'matched pairs', 'unmatched pairs'])
cols_num = data_pairs_feat.columns[(data_pairs_feat.dtypes == 'int64') | (data_pairs_feat.dtypes == 'float64')].tolist()
data_true = data_pairs_feat[data_pairs_feat['match'] == True]
data_false = data_pairs_feat[data_pairs_feat['match'] == False]
for i in range(len(cols_num)):
    for j in range(len(cols_num)):
        if i < j:
            df_corr.loc[len(df_corr.index)] = [cols_num[i], cols_num[j], data_pairs_feat[cols_num[i]].corr(data_pairs_feat[cols_num[j]]), data_true[cols_num[i]].corr(data_true[cols_num[j]]), data_false[cols_num[i]].corr(data_false[cols_num[j]])]
df_corr.sort_values(by = 'all pairs', ascending = False, inplace = True)
df_corr

The correlation structure of the numerical features are more or less similar for pairs with `match == True` and pairs with `match == False`.

In [None]:
# Correlation heatmap of numerical features
plt.figure(figsize = (10, 7.5))
sns.heatmap(data_pairs_feat[cols_num].corr(), vmin = -1, vmax = 1, annot = True, cmap = plt.cm.CMRmap_r)
plt.tight_layout()
plt.show()

- `dist_loc` is approximately uncorrelated with each of the other numerical features
- `dist_address` has slight positive correlation with `dist_name` and `dist_phone`
- `dist_url` and `dist_phone` have moderate positive correlation

## Bivariate scatterplots of numerical features

In [None]:
# Bivariate scatterplots
pairs = [(cols_num[i], cols_num[j]) for i in range(len(cols_num)) for j in range(len(cols_num)) if i < j]
for z in pairs:
    fig, ax = plt.subplots(1, 2, figsize = (15, 6), sharex = True, sharey = True)
    sns.scatterplot(data = data_pairs_feat[data_pairs_feat['match'] == True], x = z[0], y = z[1], ax = ax[0])
    ax[0].set_title("match = True", fontsize = 14)
    sns.scatterplot(data = data_pairs_feat[data_pairs_feat['match'] == False], x = z[0], y = z[1], ax = ax[1])
    ax[1].set_title("match = False", fontsize = 14)
    plt.tight_layout()
plt.show()

## Trivariate scatterplots of numerical features

In [None]:
# Trivariate scatterplots
triples = [(cols_num[i], cols_num[j], cols_num[k]) for i in range(len(cols_num)) for j in range(len(cols_num)) for k in range(len(cols_num)) if i < j < k]
for z in triples:
    fig = plt.figure(figsize = (15, 9))
    ax = fig.add_subplot(1, 2, 1, projection = '3d')
    x_true = data_pairs_feat[data_pairs_feat['match'] == True][z[0]]
    y_true = data_pairs_feat[data_pairs_feat['match'] == True][z[1]]
    z_true = data_pairs_feat[data_pairs_feat['match'] == True][z[2]]
    s1 = ax.scatter(x_true, y_true, z_true, s = 40, marker = 'o', c = y_true, alpha = 1)
    ax.set_title("match = True", fontsize = 14)
    ax.set_xlabel(z[0])
    ax.set_ylabel(z[1]) # ax.set_zlabel(z[2])
    ax = fig.add_subplot(1, 2, 2, projection = '3d')
    x_false = data_pairs_feat[data_pairs_feat['match'] == False][z[0]]
    y_false = data_pairs_feat[data_pairs_feat['match'] == False][z[1]]
    z_false = data_pairs_feat[data_pairs_feat['match'] == False][z[2]]
    s2 = ax.scatter(x_false, y_false, z_false, s = 40, marker = 'o', c = y_false, alpha = 1)
    ax.set_title("match = False", fontsize = 14)
    ax.set_xlabel(z[0])
    ax.set_ylabel(z[1])
    ax.set_zlabel(z[2])
    plt.tight_layout()
plt.show()

## Contingency tables of categorical features

In [None]:
# Function to compute contingency tables for pairs of binary features
def contingency_pairs(df, pairs, ncols = 3, figsize_multiplier = 4, update_ylabel = False):
    nrows = math.ceil(len(pairs) / ncols)
    figsize = (figsize_multiplier * ncols, 0.8 * figsize_multiplier * nrows)
    fig, ax = plt.subplots(nrows, ncols, figsize = figsize, sharey = False)
    labels = [True, False]
    for i in range(len(pairs)):
        contingency_mat = np.zeros(shape = (2, 2))
        for j in range(2):
            for k in range(2):
                contingency_mat[j][k] = len([l for l in range(len(df)) if df[pairs[i][0]][l] == labels[j] and df[pairs[i][1]][l] == labels[k]])
        contingency_df = pd.DataFrame(contingency_mat)
        hm = sns.heatmap(contingency_df, annot = True, annot_kws = {"size": 16}, fmt = 'g', ax = ax[i // ncols, i % ncols])
        hm.set_xlabel(f'{pairs[i][1]}', fontsize = 14)
        hm.set_ylabel(f'{pairs[i][0]}', fontsize = 14)
        hm.set_xticklabels(labels, fontdict = {'fontsize': 12}, rotation = 0, ha = "right")
        hm.set_yticklabels(labels, fontdict = {'fontsize': 12}, rotation = 0, ha = "right")
        if i % ncols != 0 and update_ylabel == True:
                ax[i // ncols, i % ncols].set_ylabel(" ")
    plt.tight_layout()
    plt.show()

In [None]:
# Contingency tables for pairs of binary features
cols_cat = [col for col in data_pairs_feat.columns if data_pairs_feat[col].nunique() == 2 and col != 'match']
pairs = [(cols_cat[i], cols_cat[j]) for i in range(len(cols_cat)) for j in range(len(cols_cat)) if i < j]
contingency_pairs(data_pairs_feat, pairs, ncols = 2, figsize_multiplier = 6)

## Numerical features for different classes of categorical features

In [None]:
# Numerical features for different classes of categorical features
pairs = [(cols_num[i], cols_cat[j]) for i in range(len(cols_num)) for j in range(len(cols_cat))]
ncols = 3
nrows = math.ceil(len(pairs) / ncols)
fig, ax = plt.subplots(nrows, ncols, figsize = (5 * ncols, 4.2 * nrows), sharey = False)
for i in range(len(pairs)):
    sns.violinplot(data = data_pairs_feat, x = pairs[i][1], y = pairs[i][0], ax = ax[i // ncols, i % ncols])
plt.tight_layout()
plt.show()

Expectedly, all the plots show global peak at $0$ and a local peak not far from it. This observation is consistent with the histograms of the numerical features in the univariate analysis. However, there are some finer details to check in the violinplots for the following features:
- `dist_loc` for different classes of `match_country`, `match_state`, `match_city`, `match_zip`: There is a general dependence between the two features as `dist_loc` should be less in the situations where `match_country == True` than the situations where `match_country == False`. However, this does not hold in general as it is always possible that two POIs are located at the extremely opposite regions on the same country, and also that two POIs are located at nearby regions of two neighbor countries. Similar pattern is reflected in the violinplots of `dist_loc` for different classes of `match_state`, `match_city` and `match_zip`. Interestingly, the pattern does not replicate for `dist_loc` and `match_categories`.
- `dist_name` for different classes of `match_country`, `match_state`, `match_city`, `match_zip`, `match_categories`: Though not as extreme as the cases with `dist_loc`, we observe that `dist_name` is more likely to be $0$ in the situations where the attributes `match_country`, `match_state`, `match_city`, `match_zip`, `match_categories` are *true*, than the situations where the same are *false*.
- `dist_url` and `match_country`: The violinplot for `dist_url` is far less concentrated around $0$ if `match_country == False`, than `match_country == True`. However, the same pattern cannot be seen when `dist_url` is plotted against `match_state`, `match_city` or `match_zip`.
- `dist_phone` and `match_country`: The distributions of `dist_phone` for `match_country == True` and `match_country == False` show considerable difference in concentration about $0$ and presence of outliers, however the distribution of `dist_phone` seems to be more or less unaffected by the status of `match_state`, `match_city`, `match_zip` and `match_categories`.

# Acknowledgements

- [Foursquare - Location Matching](https://www.kaggle.com/competitions/foursquare-location-matching) competition

# References

- [Correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
- [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)
- [Foursquare City Guide](https://en.wikipedia.org/wiki/Foursquare_City_Guide)
- [Foursquare Labs Inc.](https://foursquare.com/)
- [Foursquare Swarm](https://en.wikipedia.org/wiki/Foursquare_Swarm)
- [Grove Karl Gilbert](https://en.wikipedia.org/wiki/Grove_Karl_Gilbert)
- [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2)
- [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index)
- [Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3118-5)
- [Jaccard/Tanimoto similarity test and estimation methods (arxiv version)](https://arxiv.org/abs/1903.11372)
- [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
- [List of U.S. state and territory abbreviations](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations)
- [Olympus Mons](https://en.wikipedia.org/wiki/Olympus_Mons)
- [Paul Jaccard](https://en.wikipedia.org/wiki/Paul_Jaccard)
- [Point of interest](https://en.wikipedia.org/wiki/Point_of_interest)
- [Vladimir Levenshtein](https://en.wikipedia.org/wiki/Vladimir_Levenshtein)

In [None]:
# Runtime and memory usage
stop = time.time()
print(pd.Series({"Process runtime": "{:.2f} seconds".format(float(stop - start)),
                 "Process memory usage": "{:.2f} MB".format(float(process.memory_info()[0]/(1024*1024)))}).to_string())