<a href="https://colab.research.google.com/github/respect5716/deep-learning-paper-implementation/blob/main/03_NLP/Sentence_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence-T5

## 0. Info

### Paper
* title: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
* author: Jianomo Ni et al.
* url: https://arxiv.org/abs/2108.08877

### Features
* not finetuned T5-small

### Reference
* https://github.com/facebookresearch/SentEval
* https://github.com/princeton-nlp/SimCSE

## 1. Setup

In [1]:
!git clone https://github.com/facebookresearch/SentEval.git

Cloning into 'SentEval'...
remote: Enumerating objects: 691, done.[K
remote: Counting objects: 100% (2/2), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 691 (delta 0), reused 0 (delta 0), pack-reused 689[K
Receiving objects: 100% (691/691), 33.25 MiB | 3.93 MiB/s, done.
Resolving deltas: 100% (434/434), done.


In [None]:
!pip install -q transformers

In [2]:
import sys
sys.path.append('SentEval')

In [13]:
import easydict
import senteval
import numpy as np

import torch
from transformers import AutoTokenizer, AutoModel

In [14]:
cfg = easydict.EasyDict(
    model_name_or_path = 't5-small',
    max_length = 128,
)

## 2. Data

In [6]:
!cd SentEval/data/downstream/
!bash get_transfer_data.bash

bash: get_transfer_data.bash: No such file or directory


In [12]:
%cd SentEval/data/downstream/
!bash get_transfer_data.bash
%cd ../../../

/content/SentEval/data/downstream
Cloning Moses github repository (for tokenization scripts)...
Cloning into 'mosesdecoder'...
remote: Enumerating objects: 148090, done.[K
remote: Counting objects: 100% (518/518), done.[K
remote: Compressing objects: 100% (223/223), done.[K
remote: Total 148090 (delta 319), reused 443 (delta 292), pack-reused 147572[K
Receiving objects: 100% (148090/148090), 129.87 MiB | 14.39 MiB/s, done.
Resolving deltas: 100% (114345/114345), done.
mkdir: cannot create directory ‘.’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3716k  100 3716k    0     0  11.5M      0 --:--:-- --:--:-- --:--:-- 11.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 94916  100 94916    0     0   475k      0 --:--:-- --:--:-- --:--:--  475k
 

## 3. Model

In [15]:
def mean_pooling(token_embeddings, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def prepare(params, samples):
    pass

def batcher(params, batch):
    batch = [' '.join(i) for i in batch]
    inputs = tokenizer(batch, max_length=128, padding='max_length', truncation=True, return_tensors='pt')
    inputs = inputs.to(model.device)
    with torch.no_grad():
        outputs = model.encoder(**inputs).last_hidden_state
    embeds = mean_pooling(outputs, inputs.attention_mask).cpu().numpy()
    return embeds

In [16]:
tokenizer = AutoTokenizer.from_pretrained(cfg.model_name_or_path)
model = AutoModel.from_pretrained(cfg.model_name_or_path)
_ = model.eval().to('cuda')

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [17]:
params = {'task_path': 'SentEval/data/', 'seed': 42, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64, 'tenacity': 5, 'epoch_size': 4}

In [18]:
tasks = ['MR', 'STS12', 'STSBenchmark']

total_results, results = {}, {}
for task in tasks:
    se = senteval.engine.SE(params, batcher, prepare)
    res = se.eval(task)
    
    if task in ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC']:
        score = round(res['acc'], 2)
    elif task in ['STS12', 'STS13', 'STS14', 'STS15', 'STS16']:
        score = round(res['all']['spearman']['mean'] * 100, 2)
    elif task in ['STSBenchmark', 'SICKRelatedness']:
        score = round(res['spearman'] * 100, 2)
    
    results[task] = score
    total_results[task] = res
    print(f'{task} | {score}')

MR | 79.58


  sent1 = np.array([s.split() for s in sent1])[not_empty_idx]
  sent2 = np.array([s.split() for s in sent2])[not_empty_idx]


STS12 | 50.68
STSBenchmark | 69.49


In [19]:
results

{'MR': 79.58, 'STS12': 50.68, 'STSBenchmark': 69.49}