### Shopee Price Match Guarantee: Create Training Data

This notebook will demonstrate how we can generate the triplets of **anchor**, **positive** and **negative** posting ids which would be used to train a **Siamese Network** for **image based product matching**.

This can be achieved by leveraging the **group_label** column for positive and negative sampling. Addtionally, we can create **image and title triplets** using the posting id triplets for improving the predictions later on. All the results of this notebook are saved in CSV files for later use.

### Reference:
The inspiration to prep the data is drawn from Shopee - Generate data for triplet loss: https://www.kaggle.com/xhlulu/shopee-generate-data-for-triplet-loss/comments

In [None]:
#Import required libraries
import pandas as pd
import random

In [None]:
#Load the training data
train_df = pd.read_csv('../input/shopee-product-matching/train.csv')

#View sample train data
train_df.head()

In [None]:
#Create a dictionary of label group and corresponding products
label_group_dict = dict(list(train_df.groupby('label_group')))

#View sample records in title dictionary
dict_items = label_group_dict.items()
list(dict_items)[:2]

In [None]:
#Define a custom fuction to create training triplets for siamese network training
def create_train_triplets(df):
    
    #Set a random seed value
    random.seed(123)
    
    #Create a dictionary of label group and corresponding products
    label_group_dict = dict(list(df.groupby('label_group')))

    #Create alist of all label groups
    label_groups = list(label_group_dict.keys())
    
    #Create a empty dataframe to store triplet records
    triplet_df = pd.DataFrame(columns = ['anchor', 'positive', 'negative'])
    
    #Loop through all label groups to create anchor, positive and negative columns  
    for current_label in label_groups:
        
        #Create a list of all posting ids in current label group
        current_label_posting_ids = label_group_dict[current_label].posting_id.tolist()
        
        #Create triplets per posting id in current label group
        for current_posting_id in current_label_posting_ids:
            
            #Set the anchor
            anchor_id = current_posting_id
            
            ##---------------------------------------------------------------##
            ##---- We will create the positive data from same label group ---##
            ##---------------------------------------------------------------##
            
            #Create a list of all posting ids excluding the anchor id 
            other_positive_ids = [n for n in current_label_posting_ids if n != current_posting_id]
            
            #Set the positive image randomly from other positive ids of current label group
            positive_id = random.choice(other_positive_ids)
            
            
            ##---------------------------------------------------------------##
            ##--- We will create the negative data from other label groups --##
            ##---------------------------------------------------------------##
            
            #Create a list of all other label groups than current label group for negative id
            other_label_groups = [n for n in label_groups if n != current_label]
            
            #Set the negative image randomly from one of the other label groups than current label group
            negative_id = label_group_dict[random.choice(other_label_groups)].posting_id.tolist()[0]
            
            
            ##---------------------------------------------------------------##
            ##--- We will update the triplet dataframe with latest record ---##
            ##---------------------------------------------------------------##
            #update triplet_df
            triplet_df = triplet_df.append({'anchor': anchor_id,
                                            'positive': positive_id,
                                            'negative': negative_id},
                                            ignore_index = True)
            
            
    
    return triplet_df

In [None]:
#Create training data posting ids triplet dataframe
train_ids_triplets_df = create_train_triplets(train_df)

#View sample records in train_ids_triplets_df
train_ids_triplets_df.head()

Thus, we have created the posting ids triplets data to train the Siamese Network. We can now leverage this id triplets to create similar dataframes for title and image information as well to make use of them in deciding the final matches.

In [None]:
#create a dictionary of all training product images
image_dictionary = train_df.set_index('posting_id').image.to_dict()

#View sample records in image dictionary
dict_items = image_dictionary.items()
list(dict_items)[:5]

In [None]:
#Create images triplets dataframe using image_dictionary
train_images_triplets_df = train_ids_triplets_df.applymap(lambda i: image_dictionary[i])

#View sample records in train_images_triplets_df
train_images_triplets_df.head()

So we have the image information mapping of the posting ids from the training triplets data.

In [None]:
#create a dictionary of all training product titles
title_dictionary = train_df.set_index('posting_id').title.to_dict()

#View sample records in title dictionary
dict_items = title_dictionary.items()
list(dict_items)[:5]

In [None]:
#Create titles triplets dataframe using title_dictionary
train_titles_triplets_df = train_ids_triplets_df.applymap(lambda t: title_dictionary[t])

#View sample records in train_titles_triplets_df
train_titles_triplets_df.head()

So we have the title information mapping of the posting ids from the training triplets data.

We will now save all 3 data frames for training the model and improving the final matches.

In [None]:
#Save the information to csv files
train_ids_triplets_df.to_csv('train_ids_triplets.csv', index=False)
train_images_triplets_df.to_csv('train_images_triplets.csv', index=False)
train_titles_triplets_df.to_csv('train_titles_triplets_df.csv', index=False)