<a href="https://colab.research.google.com/github/yinhao0424/reuters/blob/master/ReusterSiameseNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## Few Shot Learner
By building a pooling layer on top of the BERT model, the sentence embedding has been generated. The loss function is triplet loss.  

The one shot learner has been tested on the "commodity" category. It contains two parts, comparing the support set with query without finetuning and training a finetuned classifer on support set.

***
1/2/2021
- Build dataloader
- Construct triplet loss

1/4/2021
- Generate support and testing set
- Work on existing package sentence embedding
- Build siamese NN

1/5/2020
- Debug for siamese NN
- Review Few shot learning

1/6/2021  
Test the model performance
- Test without Finetuning
  - store the embedding of support set
  - caculate the embedding of query
  - find the sample with the highest similarity score
- Finetuning
***
Reference:
- Paper/Blog
  - [BERT word embedding](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#31-running-bert-on-our-text)
  - [Sentence Embeddings using Siamese BERT-Networks - paper](https://www.aclweb.org/anthology/D19-1410.pdf)
  - [Sentence Embeddings using Siamese BERT-Networks - colab](https://github.com/aneesha/SiameseBERT-Notebook/blob/master/SiameseBERT_SemanticSearch.ipynb)
- Disscussion
    - [Generate sequence classifier](https://github.com/huggingface/transformers/issues/1001)
    - [Sequence Classification pooled output vs last hidden state](https://github.com/huggingface/transformers/issues/1328)
- Github
  -  [triplet-network-pytorch](https://github.com/andreasveit/triplet-network-pytorch/blob/master/train.py)
  - [One Shot Learning with Siamese Networks¶](https://github.com/harveyslash/Facial-Similarity-with-Siamese-Networks-in-Pytorch/blob/master/Siamese-networks-medium.ipynb)



In [None]:
# a specific version of transformaer has been used 
! pip install -q transformers==3.0.2
# !pip install -q transformers

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel

import warnings
warnings.simplefilter('ignore')
import logging
logging.basicConfig(level=logging.ERROR)

In [None]:
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")
device

device(type='cuda')

In [None]:
# config.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [None]:
# Sections of config
# Defining some key variables that will be used later on in the training
MAX_LEN = 200
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 2
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)

In [None]:
reuster_train = pd.read_csv('/content/drive/MyDrive/data/reuters/reuster_fewshot_train.csv')
reuster_train.head()

Unnamed: 0,id,topics,texts
0,4016,iron-steel,"usx <x> proved oil, gas reserves fall in 1986u..."
1,4022,carcass,argentine meat exports higher in jan/feb 1987a...
2,4022,livestock,argentine meat exports higher in jan/feb 1987a...
3,4035,veg-oil,british minister criticises proposed ec oils t...
4,4040,oilseed,china's rapeseed crop damaged by stormsthe yie...


In [None]:
reuster_train.shape

(1143, 3)

In [None]:
class SiameseDataset(Dataset):
    """
        Input: a dataframe
        output: anchor, positive and negative
    """
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.texts
        self.topics = self.data.topics
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        anchor = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = anchor['input_ids']
        mask = anchor['attention_mask']
        # token_type_ids = anchor["token_type_ids"]


        return {'anchor':{
            'index':torch.tensor(index, dtype=torch.int),
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long)},
        'positive': self.get_positive(index),
        'negative': self.get_negative(index)
        }

    def get_positive(self, index):
         # the topic
        topic = self.topics[index]

        # select positive data which have the same topic with the anchor
        candidates = self.topics[self.topics==topic].index
        p_idx = index
        while p_idx == index:
          p_idx = np.random.choice(candidates)
        
        text = str(self.text[p_idx])
        text = " ".join(text.split())

        positive = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = positive['input_ids']
        mask = positive['attention_mask']
        # token_type_ids = positive["token_type_ids"]

        return {
            'index':torch.tensor(p_idx, dtype=torch.int),
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long)}

    def get_negative(self, index):
         # the topic
        topic = self.topics[index]

        # select positive data which have the same topic with the anchor
        candidates = self.topics[self.topics!=topic].index
        n_idx = index
        n_idx = np.random.choice(candidates)
        
        text = str(self.text[n_idx])
        text = " ".join(text.split())

        negative = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = negative['input_ids']
        mask = negative['attention_mask']
        # token_type_ids = negative["token_type_ids"]

        return {
            'index':torch.tensor(n_idx, dtype=torch.int),
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long)}

In [None]:
print("TRAIN Dataset: {}".format(reuster_train.shape))

training_set = SiameseDataset(reuster_train, tokenizer, MAX_LEN)

TRAIN Dataset: (1143, 3)


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 1
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 1
                }

training_loader = DataLoader(training_set, **train_params)

In [None]:
print("The len of training loader is {}.".format(len(training_loader)))

The len of training loader is 143.


## Create the Neural Network

In [None]:
class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(768, 256)

    def forward(self, data):
        input_ids = data['ids']
        attention_mask = data['mask']

        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.Tanh()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

    # def forward(self, anchor,positive,negative):
    #     res_anchor = self.forward_once(anchor)
    #     res_positive = self.forward_once(positive)
    #     res_negative = self.forward_once(negative)
    #     return res_anchor,res_positive,res_negative

model = DistilBERTClass()
model.to(device)

DistilBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_featu

In [None]:
def triplet_loss(anchor, positive, negative):
  loss = torch.nn.TripletMarginLoss(margin=1.0, p=2)
  return loss(anchor, positive, negative)

In [None]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [None]:

def train(epoch):
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        
        anchor = {key:data['anchor'][key].cuda() for key in data['anchor']}
        positive = {key:data['positive'][key].cuda() for key in data['positive']}
        negative = {key:data['negative'][key].cuda() for key in data['negative']}
        res_anchor,res_positive,res_negative = model(anchor),model(positive),model(negative)

        optimizer.zero_grad()
        loss = triplet_loss(res_anchor,res_positive,res_negative)
        if _%20==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [None]:

for epoch in range(EPOCHS):
    train(epoch)

0it [00:00, ?it/s]

Epoch: 0, Loss:  0.949222207069397


20it [00:12,  1.64it/s]

Epoch: 0, Loss:  0.6064987182617188


40it [00:24,  1.67it/s]

Epoch: 0, Loss:  0.7553625106811523


60it [00:36,  1.66it/s]

Epoch: 0, Loss:  0.3175418972969055


80it [00:48,  1.62it/s]

Epoch: 0, Loss:  0.2462424635887146


100it [01:01,  1.64it/s]

Epoch: 0, Loss:  0.5698609352111816


120it [01:13,  1.65it/s]

Epoch: 0, Loss:  0.3806683123111725


140it [01:25,  1.67it/s]

Epoch: 0, Loss:  0.24013476073741913


143it [01:27,  1.64it/s]
0it [00:00, ?it/s]

Epoch: 1, Loss:  0.03964850306510925


20it [00:11,  1.67it/s]

Epoch: 1, Loss:  0.0872943103313446


40it [00:23,  1.65it/s]

Epoch: 1, Loss:  0.13975036144256592


60it [00:36,  1.64it/s]

Epoch: 1, Loss:  0.8357698917388916


80it [00:48,  1.66it/s]

Epoch: 1, Loss:  0.3819858729839325


100it [01:00,  1.68it/s]

Epoch: 1, Loss:  0.1067567765712738


120it [01:12,  1.64it/s]

Epoch: 1, Loss:  0.14656051993370056


140it [01:24,  1.67it/s]

Epoch: 1, Loss:  0.0


143it [01:25,  1.67it/s]


In [None]:
# save model 
PATH = '/content/drive/MyDrive/data/reuters/siamese_NN.pth'

torch.save(model.state_dict(), PATH)
print('Saved')


Saved


## Test similarity

In [None]:
reuster_support = pd.read_csv('/content/drive/MyDrive/data/reuters/fewshot_support.csv')
reuster_test = pd.read_csv('/content/drive/MyDrive/data/reuters/fewshot_test.csv')
reuster_support.head()

In [None]:
class OneShotLearning(Dataset):
    """
        Input: a dataframe
        output: index, ids, mask
    """
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.texts
        self.topics = self.data.topics
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        anchor = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = anchor['input_ids']
        mask = anchor['attention_mask']
        # token_type_ids = anchor["token_type_ids"]


        return {
            'index':torch.tensor(index, dtype=torch.int),
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long)
        }

In [None]:
print("Support Dataset: {}".format(reuster_support.shape))
print("Test Dataset: {}".format(reuster_test.shape))


support_set = OneShotLearning(reuster_support, tokenizer, MAX_LEN)
testing_set = OneShotLearning(reuster_test, tokenizer, MAX_LEN)

In [None]:
TRAIN_BATCH_SIZE = 8
SUPPORT_BATCH_SIZE = 1
TEST_BATCH_SIZE = 1

support_params = {'batch_size': SUPPORT_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': TEST_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

support_loader = DataLoader(support_set, **support_params)
testing_loader = DataLoader(testing_set, **test_params)

In [None]:
# generate support embedding
support_res = []
support_idx = []
for _,data in tqdm(enumerate(support_loader, 0)):
    support_idx.append(data['index'].tolist()[0])
    support = {key:data[key].cuda() for key in data}

    support_res.append(model(support))

In [None]:
def similar_support(testing_res):
  most_similar = 0
  most_similar_idx = None
  for idx,support in enumerate(support_res):
    out = cosine_similarity(support,testing_res)
    if out > most_similar:
      most_similar = out
      most_similar_idx = idx
  return most_similar_idx

In [None]:
true_positive = 0
for index, test in enumerate(testing_res):
  if index == 10:
    break
  most_similar_idx = similar_support(test)
  support_topic = reuster_support.iloc[support_idx[most_similar_idx]]['topics']
  test_topic = reuster_test.iloc[testing_idx[index]]['topics']
  if support_topic == test_topic:
    true_positive+=1