# Используем BERT впервые

Источник: [Jay Alamar](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)


<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" />

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will calssify each sentence as either speaking "positively" about its subject of "negatively".

## Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model is actually made up of two model.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [1]:
# !pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 1.3MB/s eta 0:00:01
[?25hCollecting sentencepiece (from transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/e6/56/2e6cfc364c4760b85adab40cb38d91e7ce67d6b2745a2e1aa1497c776fe1/sentencepiece-0.1.85-cp37-cp37m-macosx_10_6_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 1.9MB/s eta 0:00:01
Collecting sacremoses (from transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8e/ed5364a06a9ba720fddd9820155cc57300d28f5f43a6fd7b7e817177e642/sacremoses-0.0.35.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 1.9MB/s eta 0:00:01
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/aleksandr/

In [6]:
# !pip install torchvision

Collecting torchvision
[?25l  Downloading https://files.pythonhosted.org/packages/fa/71/0e76ba50c8c9aeb8349d827d02278c1b5eb4da9cdc17ca26b5bd47ec034a/torchvision-0.4.2-cp37-cp37m-macosx_10_7_x86_64.whl (641kB)
[K     |████████████████████████████████| 645kB 1.3MB/s eta 0:00:01
Collecting torch==1.3.1 (from torchvision)
[?25l  Downloading https://files.pythonhosted.org/packages/7e/94/0ed9f7899aa0f5e7ff753a3a2b6944c146eef3f4cd51c59ab07c4575992b/torch-1.3.1-cp37-none-macosx_10_7_x86_64.whl (71.1MB)
[K     |████████████████████████████████| 71.1MB 453kB/s eta 0:00:01     |█████▋                          | 12.5MB 4.3MB/s eta 0:00:14     |█████████████████████████████   | 64.6MB 3.8MB/s eta 0:00:02
Installing collected packages: torch, torchvision
Successfully installed torch-1.3.1 torchvision-0.4.2


In [16]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Данные

In [8]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [9]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


Возьмём первые 2,000.

In [10]:
batch_1 = df[:2000]

Баланс классов:

In [11]:
batch_1[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [14]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [18]:
# tokenizer = ppb.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# model = ppb.DistilBertModel.from_pretrained('distilbert-base-uncased')

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

### Токенизация
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [None]:
# text1 = batch_1[0].iloc[0]
# display(text1)

# text1_tokenised = tokenizer.encode(text1, add_special_tokens=True)
# for word_id in text1_tokenised:
#     print(tokenizer.ids_to_tokens[word_id])


# tokenizer.vocab['film']
# tokenizer.ids_to_tokens[2143]

In [19]:
tokenized = batch_1[0].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
Выравниваем предложения по длине с помощью нулевых токенов.

In [20]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [21]:
np.array(padded).shape

(2000, 59)

### Masking

Теперь создаём отдельную переменную, чтобы сказать берту, что надо игнорировать паддинг при подсчёте attention.

In [22]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

In [23]:
attention_mask[0]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [24]:
len(tokenized[0])

20

## Используем BERT

Функция `model()` прогоняет предложения через BERT.

In [25]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [26]:
last_hidden_states[0].size()

torch.Size([2000, 59, 768])

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

Берём оттуда только представления первого токена -- `[CLS]`. Это представление и будет нашими признаками.

In [27]:
features = last_hidden_states[0][:,0,:].numpy()

In [28]:
len(features[0])

768

Метки:

In [29]:
labels = batch_1[1]

In [33]:
labels[:10]

0    1
1    0
2    0
3    1
4    1
5    1
6    0
7    1
8    0
9    0
Name: 1, dtype: int64

## LogReg на фичах из BERT
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [34]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [35]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

## Оцениваем результат
Accuracy на тесте:

In [36]:
lr_clf.score(test_features, test_labels)

0.832

In [38]:
from sklearn.metrics import classification_report

In [39]:
pred = lr_clf.predict(test_features)
print(classification_report(test_labels, pred))

              precision    recall  f1-score   support

           0       0.82      0.85      0.83       248
           1       0.85      0.81      0.83       252

    accuracy                           0.83       500
   macro avg       0.83      0.83      0.83       500
weighted avg       0.83      0.83      0.83       500



Сравним с DummyClassifier

In [41]:
# классификатор который рандомно выставляет метки при этом учитывая распределение классов
# используют для сравнения со случайным ответом
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# поскольку у нас сбалансированный ds с бинарной классификацией выдает в районе 50%

Dummy classifier score: 0.495 (+/- 0.06)


А что насчёт луших классификаторов?

## Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.



And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.