# How to Bert - Shopee Price Match Guarantee, Part I
In this notebook we will experiment with Bert on the dataset from the Shopee competition. Hopefully I can show you guys how to set up a proper pipeline for this kind of task. I hope this will be of use to some of you.

Note that for the competition itself internet is not allowed, so you'll have to either train locally or make a seperate notebook to finetune and save your model before importing it to your submission notebook.

This is part I where we only deal with the titles of each item. There will be a part II also which uses image embeddings. But before we begin...

# How to really understand Shopee's unsupervised classification challenge
I have read many of your notebooks, and I am so grateful to being introduced to many of the models and approaches that you are all implementing, for example I didn't even know about EfficientNet before this challenge. I want to just summarize my thoughts about this challenge and the reason behind my approach.

This is unsupervised learning and because of this we have the problem of having to classify unseen data based on labels we also have not seen. As if this wasn't bad enough, there are two reasons why this challenge in particular is hard.

1. We don't know the number of classes
2. Many of the classes have only two instances

To be specific the training set has 34250 samples with 11014 labels. The number of labels which only occur twice is 6979. We don't know what these numbers are for the hidden test set. Let's think about it carefully for a moment. The first problem is that while we can specify the number of classes for some clustering algorithm on the training data, this is not going to work in general. Many clustering methods rely on knowing the number of classes beforehand and then they define some centroid $\mu_i$ for each class, based on which regions in feature space can be identified defining each class. The objective is then to optimize $\mu_i$ (along with model parameters) such that the classifier minimizes the average distance to the closest centroid.

I see a lot of notebooks which just apply k-means clustering in-spite of the fact that we don't know k outside of the training data. We might get lucky and estimate a good k, but still, we essentially don't know.

For this challenge, in my opinion, we're much better off defining some distance measure $D(z_i,z_j)$ that tells us the distance between two embeddings $z_i,z_j$ and then defining joint binary probabilies based on this $q(D(z_i,z_j))$. Instead of thinking in terms of classes, we should really be thinking in terms of distances. Then the problem we're faced with is to train our model to return embeddings $z_i$ which are far apart for samples which represent different products and close when the products are the same.

It is challenging to train a model to do this, since if we just pick out random pairs of embeddings, by far most of them will belong to distinct items. We can assign these the target joint probability $p(z_i,z_j)=0$ and then minimize the cross-entropy, but then we most likely will end up training our model in such a way that the $z_i$'s are just pushed away from each other in the embedding space, regardless of how close they actually are. So one of the problems we're facing is properly preparing the data. Some of this could be mitigated by picking the right $D(z_i,z_j)$ and $q(x)$ as well.

The crucial thing here is to set up a data-pipeline so we can experiment thoroughly with these three things:
1. How to split training data between sample pairs with hard labels $p(z_i,z_j)=0$ and $p(z_i,z_j)=1$.
2. How $D(\cdot)$ and $q(\cdot)$ should be defined?
3. What models to use to generate $z_i$.

With this in mind, I made the notebook that you are now reading. Hope it helps!

Edit: I realized that this methodology is essentially already described in [this well known paper](https://arxiv.org/pdf/1908.10084.pdf), which you should definitely read. Also there are really good notebooks which describe some of this stuff already. [zzy](https://www.kaggle.com/zzy990106/b0-bert-cv0-9#Use-Text-Embeddings), [Mr_KnowNothing](https://www.kaggle.com/tanulsingh077/metric-learning-pipeline-only-text-sbert) and please also take a look at [this duscussion](https://www.kaggle.com/c/shopee-product-matching/discussion/231510)

In [None]:
from tqdm.notebook import tqdm, trange

import random
import os
import gc
import pickle
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd

from transformers import BertTokenizer, BertModel, BertConfig, AdamW

EPOCHS = 25
LR = 5e-5
BATCH_SIZE = 400
VALIDGROUPS = 20

DATAPREP = False
TRAINING = False
SUBMIT = True


ENABLE_CUDA = True
SEED = 147
TOKENMAX = 20
DISTANCE_FUN = 'COS' 
LOSS_FUN = 'MSE' #BCE
MODEL_NAME = 'bert-base-multilingual-cased'


GTSPLIT = 0.2 # Fraction of paired training set samples belong to the same label_group
THRESHOLD = 0.5 
PATIENCE = 7





DATA = '../input/shopee-product-matching'
TRAINIMAGES = f'{DATA}/train_images'
TESTIMAGES = f'{DATA}/test_images'
TRAINHEADER = f'{DATA}/train.csv'
PREPTRAINHEADER = f'{DATA}/preptrain.csv'
TESTHEADER = f'{DATA}/test.csv'

SAVEFOLDER = '../input/shoppeebertmodel/'
TOKENIZER = '../input/berttokenizer/'
SUBMISSIONFOLDER = './'

if ENABLE_CUDA and torch.cuda.is_available():
    dev = torch.device('cuda')
else:
    dev = torch.device('cpu')

In [None]:

def seed_everything(seed=1001):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
class ScoreTracker:
    def __init__(self, patience=7, mode="max", delta=0.001):
        self.patience = patience
        self.counter = 0
        self.mode = mode
        self.best_score = None
        self.early_stop = False
        self.delta = delta
        if self.mode == "min":
            self.val_score = np.Inf
        else:
            self.val_score = -np.Inf
        
    def __call__(self, epoch_score, model, model_path):
        
        if self.mode == "min":
            score = -1.0 * epoch_score
        else:
            score = np.copy(epoch_score)
            
        if self.best_score is None:
            self.best_score = score
            self.save_checkpoint(epoch_score, model, model_path)
        elif score <= self.best_score:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            self.save_checkpoint(epoch_score, model, model_path)
            self.counter = 0
        
    def save_checkpoint(self, epoch_score, model, model_path):
        if epoch_score not in [-np.inf, np.inf, -np.nan, np.nan]:
            torch.save(model.state_dict(),model_path)
            
        self.val_score = epoch_score
        
class ShopeeDataset(Dataset):
    def __init__(self, df):
        super()
        self.df = df
        self.tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
        self.ids = df['posting_id']
        tokens = self.tokenizer(df['title'].to_list(),return_tensors='pt', padding=True, truncation=True, max_length=TOKENMAX)
        self.titles = tokens['input_ids']
        self.title_masks = tokens['attention_mask']
        self.partners = self.titles[df['partner_index']]
        self.partner_masks = self.title_masks[df['partner_index']]
        self.paired = torch.tensor(df['paired'].to_numpy(),dtype=torch.float)
        
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        return {
            'ids': self.ids[idx],
            'titles': self.titles[idx].to(dev),
            'title_masks': self.title_masks[idx].to(dev),
            'partners': self.partners[idx].to(dev),
            'partner_masks': self.partner_masks[idx].to(dev),
            'paired': self.paired[idx].to(dev)
            }

class ShopeeTestDataset(Dataset):
    def __init__(self, df, tokenizer):
        super()
        self.df = df
        self.tokenizer = tokenizer
        self.ids = df['posting_id']
        tokens = self.tokenizer(df['title'].to_list(),return_tensors='pt', padding=True, truncation=True, max_length=TOKENMAX)
        self.titles = tokens['input_ids']
        self.title_masks = tokens['attention_mask']
        
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        return {
            'ids': self.ids[idx],
            'titles': self.titles[idx].to(dev),
            'title_masks': self.title_masks[idx].to(dev)
            }
    
class BertUnsupervisedClassifier(nn.Module):
    def __init__(self, internet=True):
        super(BertUnsupervisedClassifier,self).__init__()
        if internet:
            self.base = BertModel.from_pretrained(MODEL_NAME)
        else:
            self.base = BertModel(BertConfig(vocab_size=119547))
        self.d = self.distance(DISTANCE_FUN,2)
        self.q = self.binary(DISTANCE_FUN)
    
    def distance(self,key,*argv):
        if key == 'COS':
            cos = nn.CosineSimilarity(dim = 1)
            return cos
        elif key == 'PDIST':
            return nn.PairwiseDistance(p = argv[0])
    
    def binary(self,key,*argv):
        if key == 'COS':
            if LOSS_FUN == 'BCE':
                eps = 1e-7
                prob = lambda x: 1/2-x/2+eps
            elif LOSS_FUN == 'MSE':
                prob = lambda x: x
                
        elif key == 'PDIST':
            prob = lambda x: 1/(x+1)
        
        return prob
    
    def forward(self,x1,m1,x2,m2):
        x1 = self.base(x1,attention_mask=m1,output_hidden_states=False)
        x1 = x1['pooler_output']
        x2 = self.base(x2,attention_mask=m2,output_hidden_states=False)
        x2 = x2['pooler_output']
        
        
        x = self.d(x1,x2)
        x = self.q(x)
        
        return x

def f1_score(row):
    n = len(np.intersect1d(row['matches'],row['target']))
    return 2*n/(len(row['matches'])+len(row['target']))

def match(model, dataset, df, threshold):
    model = model.eval()
    
    index = df.index.values
    df = df.reset_index(drop=True)
    
    
    with torch.no_grad():
        x = model.base(dataset[index]['titles'],attention_mask=dataset[index]['title_masks'],output_hidden_states=False)
        x = x['pooler_output']
        n = len(x)
        matches = [[] for i in range(n)]
        
        for i in range(n):
            matches[i].append(df.loc[i,'posting_id'])
            for j in range(i+1,n):
                q = model.q(model.d(x[i].unsqueeze(0),x[j].unsqueeze(0))).item()
                if q > threshold:
                    matches[i].append(df.loc[j,'posting_id'])
                    matches[j].append(df.loc[i,'posting_id'])

    df['matches'] = matches
    
    return df

def match_cosinesim(model, dataloader, df, threshold):
    model = model.eval()
    
    with torch.no_grad():
        x = torch.tensor([]).to(dev)
        for dataset in dataloader:
            out = model.base(dataset['titles'],attention_mask=dataset['title_masks'],output_hidden_states=False)
            x = torch.cat((x,out['pooler_output']))
        x = F.normalize(x,2,1)
        n = len(x)
        
        n_batches = n // BATCH_SIZE
        if n %BATCH_SIZE != 0:
            n_batches += 1
        
        matches = []
        q = torch.tensor([],device=dev)
        for b in range(n_batches):
            l = b*BATCH_SIZE
            r = min((b+1)*BATCH_SIZE,n)
            
            y = x[l:r].transpose(0,1)
            qq = torch.matmul(x,y).transpose(0,1)
            qq = qq > threshold
            qq = qq.cpu().numpy()
            for j in range(r-l):
                ind = np.flatnonzero(qq[j])
                matches.append(df.iloc[ind].posting_id.to_list())

    df['matches'] = matches
    
    return df

def data_prep(df,gtsplit):
    
    df['paired'] = False
    trueindices = np.random.permutation(len(df))[:int(gtsplit*len(df))]
    df.loc[trueindices,'paired'] = True
    
    df['partner_id'] = None
    df['partner_index'] = None
    
    for i in tqdm(df.index):
        if df.loc[i,'paired']:
            ingroup = df[df['label_group'] == df.loc[i,'label_group']]
            j = random.choice(ingroup.index.drop(i))
            df.loc[i,'partner_id'] = ingroup.loc[j,'posting_id']
            df.loc[i,'partner_index'] = j
        else:
            notingroup = df[df['label_group'] != df.loc[i,'label_group']]
            j = random.choice(notingroup.index)
            df.loc[i,'partner_id'] = notingroup.loc[j,'posting_id']
            df.loc[i,'partner_index'] = j
    
    return df

def train_fn(model,dataloader,validdf,loss_fn,optimizer,epochs):
    tracker = ScoreTracker(PATIENCE)
    scores = []
    
    n_batches = len(dataloader)
    
    validdataset = ShopeeTestDataset(validdf)
    
    
    bar = tqdm(range(epochs))
    bar.set_description('Training')
    
    for i in bar:
        model = model.train()
        lossav = 0
        
        for d in dataloader:
            
            optimizer.zero_grad()
            
            titles = d['titles']
            titlemasks = d['title_masks']
            partners = d['partners']
            partner_masks = d['partner_masks']
            target = d['paired']
            
            outputs = model(titles,titlemasks,partners,partner_masks)
            loss = loss_fn(outputs,target)
            loss.backward()
            optimizer.step()
            loss = loss.detach().item()
            lossav += loss
        lossav /= n_batches
        scores.append(lossav)
        
        validdf = match(model,validdataset,validdf,THRESHOLD)
        validdf['f1score'] = validdf.apply(f1_score,axis=1)
        score = validdf['f1score'].mean()
        scores.append(score)
        tracker(score,model,f'{SAVEFOLDER}model.pt')
        bar.write(f"Epoch: {i} Loss: {lossav} Score: {score}")
        
        if tracker.early_stop:
            break
    del outputs
        
    return scores

In [None]:
if DATAPREP:
    train = pd.read_csv(TRAINHEADER)
    train = data_prep(train,GTSPLIT)
    train.to_csv(PREPTRAINHEADER,index=False)



if TRAINING:
    if not DATAPREP:
        train = pd.read_csv(PREPTRAINHEADER)
    
    
    group_dicts = train.groupby('label_group')['posting_id'].unique().to_dict()
    train['target'] = train['label_group'].map(group_dicts)
    labels = train['label_group'].unique().tolist()
    labels = random.sample(labels,VALIDGROUPS)
    
    validset = train.loc[[ train.loc[i,'label_group'] in labels for i in range(len(train)) ]]
    validset = validset.reset_index(drop=True)
    
    dataset = ShopeeDataset(train)
    
    vN = len(validset)
    
    baseline = [[validset.loc[i,'posting_id']] for i in range(vN) ]
    
    validset['matches'] = baseline
    validset['f1score'] = validset.apply(f1_score,axis=1)
    
    print(f"Baseline score for validationset is {validset['f1score'].mean()}")
    
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)
    if LOSS_FUN == 'MSE':
        loss_fn = nn.MSELoss()
    elif LOSS_FUN == 'BCE':
        loss_fn = nn.BCELoss()

    model = BertUnsupervisedClassifier().to(dev)
    optimizer = AdamW(model.parameters(), lr=LR)
    
    
    results = train_fn(model,dataloader,validset,loss_fn,optimizer,EPOCHS)
        
    del model

if SUBMIT:
    test = pd.read_csv(TESTHEADER)
    
    with open(f'{TOKENIZER}tokenizer.pickle', 'rb') as f:
        tokenizer = pickle.load(f)
    
    testdataset = ShopeeTestDataset(test,tokenizer)
    dataloader = DataLoader(testdataset, batch_size=BATCH_SIZE)
    
    model = BertUnsupervisedClassifier(internet = False).to(dev)
    model.eval()
    model.load_state_dict(torch.load(f'{SAVEFOLDER}model.pt'))
    
    
    test = match_cosinesim(model,dataloader,test,THRESHOLD)
    
    matches = test['matches']
    
    for i in range(len(matches)):
        for j in range(len(matches[i])):
            matches[i][j] += ' '
    
    matches =  [str.join('',matches[i]) for i in range(len(test))]
    
    submission = pd.DataFrame({'posting_id': test['posting_id'], 'matches': matches})
    
    
    
    submission.to_csv(f'{SUBMISSIONFOLDER}submission.csv',index=False)