**Bidirectional Encoder Representations from Transformers (BERT)**

Это файл о моем первом опыте с BERT
<br><br>

Статьи, которые помогли разбраться:
<br>

[Визуализируя нейронный машинный перевод (seq2seq модели с механизмом внимания)](https://habr.com/ru/post/486158/)

[Transformer в картинках](https://habr.com/ru/post/486358/)

[BERT, ELMO и Ко в картинках (как в NLP пришло трансферное обучение)](https://habr.com/ru/post/487358/)

[Ваш первый BERT: иллюстрированное руководство](https://habr.com/ru/post/498144/)

[A Visual Notebook to Using BERT for the First TIme.ipynb](https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb)
<br><br>
Код в целом без пояснений и комментариев, так как все подробно изложен в приведенных статьях.
<br><br>





In [None]:
# тестовый запуск, проверка работоспособности блокнота
print('test')

test


In [None]:
# устанавливаем трансформеры
!pip install transformers 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1


In [None]:
# устанавливаем библиотеки

import numpy as np
import pandas as pd
import torch
import transformers as ppb

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
# скачиваем dataset
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [None]:
# выбираем фрагмент для ускорения и экономии времени
batch_1 = df[:2000]

In [None]:
# проверяем, что с данными все ок
print(batch_1[:5])

                                                   0  1
0  a stirring , funny and finally transporting re...  1
1  apparently reassembled from the cutting room f...  0
2  they presume their audience wo n't sit still f...  0
3  this is a visually stunning rumination on love...  1
4  jonathan parker 's bartleby should have been t...  1


In [None]:
# Загружаем предобученную модель и токенизаторы 
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# токенизируем 
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [None]:
# делаем из списков массив, чтобы была одинаковая длина

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
# маскируем сделанные добавления
attention_mask = np.where(padded != 0, 1, 0)

In [None]:
# создаем входной вектор из матрицы токенов

input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [None]:
# разрезаем этот 3d тензор, чтобы получить нужный нам 2d тензор
features = last_hidden_states[0][:,0,:].numpy()

In [None]:
# разделяем данные на обучающую и тестовую выборки
labels = batch_1[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [None]:
# обучаем модель логистической регрессии на обучающей выборке
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression()

In [None]:
# модель обучена, смотрим метрики 
print('# train:',lr_clf.score(train_features, train_labels))
print('# test:',lr_clf.score(test_features, test_labels))

# train: 0.906
# test: 0.844



**Собственная часть**

Проверим, что все корректно

In [None]:
# распечатаем фрагмент из датасета
print(batch_1[:5])

                                                   0  1
0  a stirring , funny and finally transporting re...  1
1  apparently reassembled from the cutting room f...  0
2  they presume their audience wo n't sit still f...  0
3  this is a visually stunning rumination on love...  1
4  jonathan parker 's bartleby should have been t...  1


In [None]:
# посчитаем сами и сравним

def LR(x): return (1 if x > 0 else 0)
def correct(LR, label): return True if LR - label == 0 else False

sum_false = 0
N = 5

for i in range (N):
  if i and i % 100 == 0: print(i)
  array = np.array(tokenized[i])
  array = np.pad(array, (0, max_len - len(array)), 'constant')
  attention_mask_this = np.where(array != 0, 1, 0)
  input_ids_this = torch.tensor([array])  
  attention_mask_this = torch.tensor(attention_mask_this)
  with torch.no_grad():
      last_hidden_states_this = model(input_ids_this, attention_mask=attention_mask_this)
  features_this = last_hidden_states_this[0][:,0,:].numpy()

  sum = np.dot(features_this,lr_clf.coef_[0]) 

  if N < 20: print(i, labels[i], LR(sum), correct(LR(sum), labels[i]), sum)  
  if correct(LR(sum), labels[i]) == True: sum_false += 1

print()
print(sum_false/(i+1))  

0 1 1 True [3.85857511]
1 0 0 True [-6.51904703]
2 0 0 True [-2.21694384]
3 1 1 True [3.65850942]
4 1 0 False [-0.32528463]

0.8


Работаем со своими фрагментами текста
<br><br>

In [None]:
# задаем свои фрагменты

texts = [
    'All is good', 
    'it is so bad', 
    'nice to meet you',
    'it is so rainy',
    'he is a stupid',
    'I like my car',
    'have a nice day'    
    ]

In [None]:
# Классифицируем свои фрагменты

for text in texts:
  array = tokenizer.encode(text, add_special_tokens=True)
  array = np.pad(array, (0, max_len - len(array)), 'constant')
  attention_mask_this = np.where(array != 0, 1, 0)
  input_ids_this = torch.tensor([array])  
  attention_mask_this = torch.tensor(attention_mask_this)
  with torch.no_grad():
      last_hidden_states_this = model(input_ids_this, attention_mask=attention_mask_this)
  features_this = last_hidden_states_this[0][:,0,:].numpy()

  sum = np.dot(features_this,lr_clf.coef_[0]) 

  print(text, ':', sum, ':', LR(sum)) 

All is good : [1.90097556] : 1
it is so bad : [-1.86999172] : 0
nice to meet you : [4.96587936] : 1
it is so rainy : [-2.53429642] : 0
he is a stupid : [-2.4318853] : 0
I like my car : [0.2118134] : 1
have a nice day : [2.30864912] : 1


In [None]:
# На этом все )