## Generate triplets training data

_[Shopee - Price Match Guarantee](https://www.kaggle.com/c/shopee-product-matching)_

This notebook shows you how to generate triplets of training data by leveraging the `group_label` column for positive sampling, and also group-level negative sampling. The results of this notebook are CSV files with an anchor, positive, and negative column, where each value corresponds to either the product ID, image name, or product title.

In [1]:
import random

import pandas as pd
from tqdm.auto import tqdm

tqdm.pandas()

This helper function will let you generate an anchor, a positive sample from the same label group, and a negative sample from a different label group. The external function wraps around a certain dataframe, and the inner function should be applied to a row of that `df`.

In [1]:
def generate_triplets(df):
    # Source: https://www.kaggle.com/xhlulu/shopee-generate-data-for-triplet-loss
    random.seed(42)
    group2df = dict(list(df.groupby('label_group')))
    
    def aux(row):
        anchor = row.posting_id
        
        # We sample a positive data point from the same group, but
        # exclude the anchor itself
        ids = group2df[row.label_group].posting_id.tolist()
        ids.remove(row.posting_id)
        positive = random.choice(ids)
        
        # Now, this will sample a group from all possible groups, then sample 
        # a product from that group
        groups = list(group2df.keys())
        groups.remove(row.label_group)
        neg_group = random.choice(groups)
        negative = random.choice(group2df[neg_group].posting_id.tolist())

        return anchor, positive, negative
    
    return aux

Load the training data and create some useful dictionaries for later:

In [1]:
train = pd.read_csv('../input/shopee-product-matching/train.csv')

# Useful dictionaries; use below to convert if needed
id_to_img = train.set_index('posting_id').image.to_dict()
id_to_title = train.set_index('posting_id').title.to_dict()

Here, we use the `generate_triplets` helper function defined above and create a new dataframe from it:

In [1]:
train_triplets = train.progress_apply(generate_triplets(train), axis=1).tolist()
train_triplets_df = pd.DataFrame(train_triplets, columns=['anchor', 'positive', 'negative'])
train_triplets_df.head()

From the `train_triplets_df` you can create a triplet dataframe of titles:

In [1]:
train_triplets_titles = train_triplets_df.applymap(lambda x: id_to_title[x])
train_triplets_titles.head()

The same works for images:

In [1]:
train_triplets_imgs = train_triplets_df.applymap(lambda x: id_to_img[x])
train_triplets_imgs.head()

Let's save everything so you can easily use the output of this notebook. Alternatively, you can copy/paste the helper function as well and use it directly with the code above!

In [1]:
train_triplets_imgs.to_csv('train_triplets_imgs.csv', index=False)
train_triplets_titles.to_csv('train_triplets_titles.csv', index=False)
train_triplets_df.to_csv('train_triplets_ids.csv', index=False)