This is a fork from https://www.kaggle.com/sudalairajkumar notebook.

In this notebook, let us see how to create additional matched and non-matched pairs data that can be used for augmenting the `pairs.csv` file.

In [None]:
import numpy as np 
import pandas as pd 

train_df = pd.read_csv("/kaggle/input/foursquare-location-matching/train.csv")
train_df.head()

We are given the `point_of_interest` column. If two rows have the same `point_of_interest` value, then it means that both the ids represent the same place and is a match in our context.

Let us look at tha `pairs.csv` file to get an idea about the structure so as to craete matched pairs in same format.

In [None]:
pairs_df = pd.read_csv("/kaggle/input/foursquare-location-matching/pairs.csv")
pairs_df.head()

In [None]:
pairs_df.columns

In [None]:
pairs_df.shape

We see that the `pairs.csv` file has `id_1` attributes followed by `id_2` attributes and then finally a match column. 

So let us create matched pairs in the same format now.

In [None]:
match_df = pd.merge(train_df, train_df, on="point_of_interest", suffixes=('_1', '_2'))
match_df = match_df[match_df["id_1"]!=match_df["id_2"]]
match_df = match_df.drop(["point_of_interest"], axis=1)
match_df["match"] = True
match_df.head()

In [None]:
match_df.columns

In [None]:
match_df.shape

Now let's create non-matched pairs

In [None]:
n_samples = 1000
sample_1 = train_df.sample(n_samples).assign(key=0)
sample_2 = train_df.sample(n_samples).assign(key=0)
match_sample = (
    sample_1
    .merge(sample_2, on="key", how="outer", suffixes=('_1', '_2'))
    .query("id_1 != id_2")
    .assign(match=lambda x: x["point_of_interest_1"] == x["point_of_interest_2"])
    .drop(columns=["point_of_interest_1", "point_of_interest_2", "key"])
)

match_sample.shape

In [None]:
match_sample.match.value_counts()

We have created some extra samples in `pairs.csv` dataset.

Happy Kaggling!