<a href="https://colab.research.google.com/github/tehilamal/GeekWeek2019/blob/master/lecture_5/Text_Mining_2024_cex5_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CE5: BERT and Co.

## 0. Dataset and Baselines

In [1]:
# basic imports
import numpy as np
import pandas as pd
import warnings
# warnings.filterwarnings('ignore')

### 0.1. Dataset

The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

**SST2 on HuggingFace:**

https://huggingface.co/datasets/stanfordnlp/sst2

**More on SST2:**

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

Binary classification experiments on full sentences (negative or somewhat negative vs somewhat positive or positive with neutral sentences discarded) refer to the dataset as SST-2 or SST binary.

#### 0.1.1. Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [2]:
dataset_url = 'https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'

In [3]:
df = pd.read_csv(dataset_url, delimiter='\t', header=None)

In [4]:
sentences = df[0]
labels = df[1]

In [7]:
sentences.head(), labels.head()

(0    a stirring , funny and finally transporting re...
 1    apparently reassembled from the cutting room f...
 2    they presume their audience wo n't sit still f...
 3    this is a visually stunning rumination on love...
 4    jonathan parker 's bartleby should have been t...
 Name: 0, dtype: object,
 0    1
 1    0
 2    0
 3    1
 4    1
 Name: 1, dtype: int64)

For performance reasons, we'll only use 2,000 sentences from the dataset when using BERT.

In [8]:
subdf = df[:2000]

In [9]:
sub_sentences = subdf[0]
sub_labels = subdf[1]

Let's see how many sentences are labeled as having a positive sentiment (value 1) and how many are labeled with negative sentiment (value 0):

In [10]:
subdf[1].value_counts()

Unnamed: 0_level_0,count
1,Unnamed: 1_level_1
1,1041
0,959


#### 0.1.2. Train/Test Split

In [11]:
from sklearn.model_selection import train_test_split

##### Q1: Split the sentence and labels to train and test sets

Then do the same for `sub_sentences` and  `sub_labels`.

**Hint:** Look at `sklearn`'s `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split

sub_sentences_train, sub_sentences_test, sub_sentences_train, sub_sentences_test = train_test_split(sub_sentences, sub_labels, test_size=0.2, random_state=42)


### 0.2. Baselines

#### 0.2.0. The Dummy Classifier

DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more complex classifiers.

https://scikit-learn.org/1.5/modules/generated/sklearn.dummy.DummyClassifier.html

##### Q2: Use sklearn's dummy classifier as a baseline

---



In [None]:
# prompt: Use sklearn's dummy classifier as a baseline

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# Assuming sub_sentences_train, sub_sentences_test, sub_labels_train, sub_labels_test are defined as in the original code.
# If not, replace with your actual train/test split.

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(sub_sentences_train, sub_labels_train) # Fit the dummy classifier
dummy_predictions = dummy_clf.predict(sub_sentences_test) # Make predictions
dummy_accuracy = accuracy_score(sub_labels_test, dummy_predictions)
print(f"Dummy classifier accuracy: {dummy_accuracy}")

#### 0.2.1. LogReg w/ BoW

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [13]:
# CountVectorizer for Bag-of-Words model
vectorizer = CountVectorizer(max_features=300)

##### Q3: Complete the BoW-based baseline

What accuracy does it achieve on the test set?

#### 0.2.2. LogReg w/ Word2Vec

In [14]:
# !pip install gensim

In [15]:
import gensim.downloader

In [None]:
glove_embedder = gensim.downloader.load('glove-twitter-200')



In [None]:
import numpy as np

def document_to_vector(document, model):
    token_list = gensim.utils.simple_preprocess(document)
    vector_list = []
    for token in token_list:
        try:
            vector_list.append(model[token])
        except KeyError:
            pass
    return np.average(vector_list, axis=0)

In [None]:
from scipy.sparse import csr_matrix

def vector_list_to_sparse_matrix(vector_list):
    dense_matrix = np.stack(vector_list)
    return csr_matrix(dense_matrix)

In [None]:
def sentence_list_to_sparse_embeddings(sentence_list, embedding_model):
    vector_list = [document_to_vector(sent, embedding_model) for sent in sentence_list]
    return vector_list_to_sparse_matrix(vector_list)

In [None]:
# Transform text data
train_glove_vectors = sentence_list_to_sparse_embeddings(train_sentences, glove_embedder)
test_glove_vectors = sentence_list_to_sparse_embeddings(test_sentences, glove_embedder)

In [None]:
lr_glove_clf = LogisticRegression()
lr_glove_clf.fit(train_glove_vectors, train_labels)

In [None]:
lr_glove_clf.score(test_glove_vectors, test_labels)

## 1. (Old School) DistilBERT for Sentence Classification

Let's use a pretrained BERT(-like) model for sentiment classification.

We will use DistilBERT, a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.

First, we will use it in the way the BERT paper suggested document vectors should be extracted from BERT for text classification tasks.

### 1.1. Installing the transformers library

Let's start by installing the huggingface transformers library so we can load pre-trained transformer models.

In [None]:
!pip install transformers

### 1.2. Library imports

In [None]:
# On Colab, run:
# %tensorflow_version 1.x

In [None]:
# sklearn
from sklearn.linear_model import LogisticRegression

# transformers and related packages
import transformers
import torch  # pytorch, perhaps THE deep learning framework in Python (but see also Keras and TensorFlow)

### 1.3. Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model.

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertModel, transformers.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, transformers.BertTokenizer, 'bert-base-uncased')

# Load pretrained model and tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
embedder = model_class.from_pretrained(pretrained_weights)

### 1.4. Preprocessing
Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the format it requires.

In [None]:
def bert_preprocess_sentence_list(sent_list):
    # 1 - Tokenization
    tokenized = sent_list.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
    # 2 - Padding
    # After tokenization, `tokenized` is a list of sentences -- each sentences is represented
    # as a list of tokens. We want BERT to process our examples all at once (as one batch).
    # It's just faster that way.
    # For that reason, we need to pad all lists to the same size, so we can represent the
    # input as one 2-d array,
    # rather than a list of lists (of different lengths).
    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)
    padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
    # print("Shape after padding:")
    # print(np.array(padded).shape)
    # 3 - Masking
    # If we directly send `padded` to BERT, that would slightly confuse it.
    # We need to create another variable to tell it to ignore (mask) the padding
    # we've added when it's processing its input. That's what attention_mask is:
    attention_mask = np.where(padded != 0, 1, 0)
    # attention_mask.shape
    return padded, attention_mask

Our dataset is now in the `padded` variable, we can view its dimensions below:

### 1.5. Embedding our sentences


The `embedder()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

Like suggested in the original paper, we willslice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [None]:
def preprocessed_vectors_to_embeddings(padded, attention_mask):
    input_ids = torch.tensor(padded)
    attention_mask = torch.tensor(attention_mask)
    with torch.no_grad():
        last_hidden_states = embedder(input_ids, attention_mask=attention_mask)
    # taking only the [CLS] vector
    features = last_hidden_states[0][:,0,:].numpy()
    return features

##### Q4: Define the sentenece_list_to_bert_cls_embeddings function

**Hint**: Use the two functions pre-defined above.

##### Q5: Get the DistilBERT embeddings for our case

Populate `distilbert_train_vectors` and `distilbert_test_vectors`.

### 1.6. Train and evaluate an LR model

In [None]:
bert_lr_clf = LogisticRegression()
bert_lr_clf.fit(distilbert_train_vectors, sub_train_labels)

In [None]:
bert_lr_clf.score(distilbert_test_vectors, sub_test_labels)

**What would we usually do to improve performance?**

1. More sophisticated classification models (if you have enogh data, a deep learning model).
2. Different embedding models.
3. If we have a large enough labeled dataset, [fine-tuning](https://huggingface.co/transformers/examples.html#glue).

## 2. Modern Usage of the Transformers package

This example will demonstrate two things:
1. The modern, streamlined way to use the `transformers` package.
2. Using a model pretrained for a specific case (rather than training a new model on pre-trained word/document embeddings).

### 2.1. Library Imports

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

### 2.2. Loading the pretrained tokenized, model and configuration

We will use a highly popular (on HuggingFace) model for sentiment analyasis:

Twitter-roBERTa-base for Sentiment Analysis

It was finetuned to do well for sentiment analysis of tweets (with the TweetEval benchmark in mind).
Let's see how it fares on movie sentiment!

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)

In [None]:
# Load the pretrained model
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

### 2.3. Usage example for a single text

In [None]:
# example on a single text
text = "Covid cases are increasing fast!"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

### 2.4. A nice prediction function

##### Q6: Write get_roberta_sentiment_pred

Define the `get_roberta_sentiment_pred` function, which gets a string (some text) and returns 1 our model detects a positive sentiment, and 0 otherwise.

**Hint 1:** Use the code above as a basis.

**Hint 2:** There are several different possible logics to write to get a binary (0/1) sentiment score based on the above trinary scores. Go wild!

In [None]:
get_roberta_sentiment_pred("Covid cases are increasing fast!")

In [None]:
get_roberta_sentiment_pred("This is so great!")

### 2.5. Predict and Evaluate on our test set

*(not an efficient way to do this, but I won't teach you the matricized way)*

In [None]:
sub_test_preds = [get_roberta_sentiment_pred(sent) for sent in sub_test_sentences]

In [None]:
# Calculate Accuracy
accuracy = accuracy_score(sub_test_labels, sub_test_preds)
print(f"Accuracy: {accuracy:.4f}")

## 3. SST2 scores of other models

https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary

For reference, the highest accuracy score for this dataset in the above benchmark is currently **97.5**, by
T5-11B.

However, it's pretty old; I assume there are better models out there: Papers With Code benchmarks only includes scores published in academic papers.

DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task. The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.