# Portfolio-Exam Part I - Sentiment Analysis

* Social Media Analytics - MADS-SMA
* Valentin Werger

#### Addition to the first notebook: Evaluatin different transformer models

When running in Google colab, uncomment the first two line

In [1]:
#!pip install transformers &> /dev/null

In [2]:
#!pip install datasets &> /dev/null

In [3]:
import pandas as pd
import torch
import datasets
from datasets import load_metric
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig, get_scheduler
from torch.utils.data import DataLoader
from torch.optim import AdamW
import subprocess

Now we want to compare the previous performances to those of finetuned transformer models on the same yelp reviews. In theory these models should be better at predicting the sentiment, since they have learned a contextualized understanding of the text and are able to apply that to the new task. In contrast the previous machine learning algorithms are only give static word counts or vectors, which fail to capture multiple meanings or meanings in a specific word order.

In [4]:
yelp = pd.read_csv("yelp_reviews_hamburg_en.csv", parse_dates=["date"], dtype={"stars":"int64"})
# Subtract one from stars because some models (XGBoost, Transfomers) expects labels to be starting from 0
yelp["stars"] = yelp.stars - 1

For easier handling of the data they are transformed into a huggingface dataset. While doing that a test set of 20% size ist split off, so the performance of the finetuned model can be estimated later. This is less robust than the previous approach with cross validation, but will give an idea whether the transformers are actually stronger.

In [5]:
# Transform data into huggingface
yelp["review_length"] = [len(review) for review in yelp["text"]]
yelp_huggingface = datasets.Dataset.from_pandas(yelp.drop(columns = ["url", "date"])).train_test_split(test_size=0.2, seed=10)
yelp_huggingface = yelp_huggingface.sort("review_length", reverse = True)

yelp_huggingface

DatasetDict({
    train: Dataset({
        features: ['stars', 'text', 'review_length'],
        num_rows: 2420
    })
    test: Dataset({
        features: ['stars', 'text', 'review_length'],
        num_rows: 605
    })
})

In [6]:
# Test how often token amount would exceed the limit
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    result = tokenizer(
        examples["text"])
    
    return result


tokenized_datasets = yelp_huggingface.map(tokenize_function, batched=True)

# How often would the amount of tokens exceed Berts maximum length of token inputs?
print(pd.Series([len(x) > 512 for x in tokenized_datasets["train"]["input_ids"]]).value_counts())
print(pd.Series([len(x) > 512 for x in tokenized_datasets["test"]["input_ids"]]).value_counts())

  0%|          | 0/3 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1105 > 512). Running this sequence through the model will result in indexing errors


  0%|          | 0/1 [00:00<?, ?ba/s]

False    2400
True       20
dtype: int64
False    601
True       4
dtype: int64


There are only few instances where the tokenization produces an input longer than the maximum allowed input of 512 so it would not be a big problem to simply truncate those instances. But for the sake of experience I will accomodate for them in the following function. This is done by specifying return_overflowing_tokens=True which gives back the tokens longer than the max_length, which are simply appended so the model input becomes slightly longer than the original amount of reviews. A stride of 100 means that in these cases an overlap of 100 tokens is enforced to not loose import contexts by splitting the text. The sample map allows to map the orginial columns back at the correct positions. At last a pytorch DataLoader is created from the tokenized data.

An alternative for very long documents, that is not explored, here would be to first summarize them before using them as input for the transformer.

In [7]:
def prepare_data(data, pretrained_model, batch_size = 8):

  tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

  def tokenize_function(examples):
    result = tokenizer(
        examples["text"], 
        max_length=512, 
        padding="max_length", 
        truncation=True,
        return_overflowing_tokens=True, 
        stride=100)
    
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result


  tokenized_datasets = data.map(tokenize_function, batched=True)

  tokenized_datasets = tokenized_datasets.remove_columns(["text", "review_length"])
  tokenized_datasets = tokenized_datasets.rename_column("stars", "labels")
  tokenized_datasets.set_format("torch")

  train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=batch_size)
  eval_dataloader = DataLoader(tokenized_datasets["test"], batch_size=batch_size)

  return train_dataloader, eval_dataloader

The chosen model is trained with AdamW as optimizer and a fixed learning rate unless varied by the user. If a GPU device is available (for example in a colab Notebook marked as GPU runtime or on a local machine with cuda-enabled torch installation) the training loop is processed on that device increasing the speed by magnitudes.

In [8]:
def train_model(model, train_dataloader, lr=5e-5, epochs=10):

  optimizer = AdamW(model.parameters(), lr=lr)

  num_training_steps = epochs * len(train_dataloader)
  lr_scheduler = get_scheduler(
      name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
  )

  device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
  print(device)
  model.to(device)

  progress_bar = tqdm(range(num_training_steps))

  model.train()
  for epoch in range(epochs):
      for batch in train_dataloader:
          batch = {k: v.to(device) for k, v in batch.items()}
          outputs = model(**batch)
          loss = outputs.loss
          loss.backward()

          optimizer.step()
          lr_scheduler.step()
          optimizer.zero_grad()
          progress_bar.update(1)

  return model

The finetuned model can be evaluated with the eval_dataloader and a list of specified metrics.

In [9]:
def evaluate_model(model, eval_dataloader, metrics, print_metric_info=False):

  device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

  metric_list = []
  for metric_config in metrics:
    metric_dict = {}
    metric_dict["metric"] = load_metric(metric_config["name"])
    metric_dict["additional_arguments"] = metric_config["additional_arguments"]
    if print_metric_info:
      print(metric_dict["metric"].inputs_description)

    metric_list.append(metric_dict)

  model.eval()
  for batch in eval_dataloader:
      batch = {k: v.to(device) for k, v in batch.items()}
      with torch.no_grad():
          outputs = model(**batch)

      logits = outputs.logits
      predictions = torch.argmax(logits, dim=-1)

      for metric in metric_list:
        metric["metric"].add_batch(predictions=predictions, references=batch["labels"])

  def merge_dicts(dict1, dict2):
    return {**dict1, **dict2}
  metrics_results = {}

  for metric in metric_list:
    metrics_results = merge_dicts(metrics_results, metric["metric"].compute(**metric["additional_arguments"]))

  return metrics_results

In [10]:
def show_gpu(msg):
    """
    ref: https://discuss.pytorch.org/t/access-gpu-memory-usage-in-pytorch/3192/4
    """
    def query(field):
        return(subprocess.check_output(
            ['nvidia-smi', f'--query-gpu={field}',
                '--format=csv,nounits,noheader'], 
            encoding='utf-8'))
    def to_int(result):
        return int(result.strip().split('\n')[0])
    
    used = to_int(query('memory.used'))
    total = to_int(query('memory.total'))
    pct = used/total
    print('\n' + msg, f'{100*pct:2.1f}% ({used} out of {total})') 

Three pretained transformer models from huggingface are tested in the following section:

* bert-base-uncased
  * pretrained on a large corpus of English data in a self-supervised fashion
  * trained through Masked language modeling (MLM) and Next sentence prediction (NSP)
* roberta-base
  * trained on wikipedia corpus and a large corpus of englisch books
  * self trained through Masked language modeling (MLM)
* distilbert-base-uncased-finetunded-sst-2-english
  * fine tuned version of distilbert on SST-2 film review data

In [11]:
show_gpu("GPU usage")


GPU usage 32.0% (2625 out of 8192)


### bert-base-uncased

In [12]:
train_dataloader, eval_dataloader = prepare_data(yelp_huggingface, "bert-base-uncased")

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [13]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5)
finetuned_model = train_model(model, train_dataloader, epochs = 3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

cuda


  0%|          | 0/918 [00:00<?, ?it/s]

In [14]:
evaluate_model(finetuned_model, 
               eval_dataloader, 
               [
                {"name":"accuracy", "additional_arguments":{}}, 
                {"name":"f1", "additional_arguments": {"average":"macro"}},
                {"name": "precision", "additional_arguments": {"average": "macro"}},
                {"name": "recall", "additional_arguments": {"average": "macro"}}
                ], 
               print_metric_info=False)

{'accuracy': 0.6617405582922824,
 'f1': 0.5855779921189521,
 'precision': 0.6110926669001083,
 'recall': 0.586010698537382}

In [15]:
del finetuned_model
torch.cuda.empty_cache()
show_gpu("GPU usage")


GPU usage 22.8% (1864 out of 8192)


### roberta-base

In [16]:
train_dataloader, eval_dataloader = prepare_data(yelp_huggingface, "roberta-base")

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [17]:
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=5)
finetuned_model = train_model(model, train_dataloader, epochs = 3)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

cuda


  0%|          | 0/915 [00:00<?, ?it/s]

In [18]:
evaluate_model(finetuned_model, 
               eval_dataloader, 
               [
                {"name":"accuracy", "additional_arguments":{}}, 
                {"name":"f1", "additional_arguments": {"average":"macro"}},
                {"name": "precision", "additional_arguments": {"average": "macro"}},
                {"name": "recall", "additional_arguments": {"average": "macro"}}
                ], 
               print_metric_info=False)

{'accuracy': 0.639344262295082,
 'f1': 0.5084989366349728,
 'precision': 0.6038327562186113,
 'recall': 0.5351313328595577}

In [19]:
del finetuned_model
torch.cuda.empty_cache()
show_gpu("GPU usage")


GPU usage 23.8% (1947 out of 8192)


### distilbert-base-uncased-finetunded-sst-2-english

In [20]:
train_dataloader, eval_dataloader = prepare_data(yelp_huggingface, "distilbert-base-uncased-finetuned-sst-2-english")

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [21]:
config = AutoConfig.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
config.num_labels = 5
model = AutoModelForSequenceClassification.from_config(config)

finetuned_model = train_model(model, train_dataloader, epochs = 5)

cuda


  0%|          | 0/1530 [00:00<?, ?it/s]

In [22]:
evaluate_model(finetuned_model, 
               eval_dataloader, 
               [
                {"name":"accuracy", "additional_arguments":{}}, 
                {"name":"f1", "additional_arguments": {"average":"macro"}},
                {"name": "precision", "additional_arguments": {"average": "macro"}},
                {"name": "recall", "additional_arguments": {"average": "macro"}}
                ], 
               print_metric_info=False)

  _warn_prf(average, modifier, msg_start, len(result))


{'accuracy': 0.5467980295566502,
 'f1': 0.35070834672395523,
 'precision': 0.36899411449973246,
 'recall': 0.35379784305685036}

In [23]:
del finetuned_model
torch.cuda.empty_cache()
show_gpu("GPU usage")


GPU usage 27.4% (2244 out of 8192)


### Comparison

bert-base-uncased achieves the best accuracy with about 0.66 in contrast to roberta-base with only about 0.64. But the more relevant metric is the F1 Score. It is the harmonic mean between precision and recall, so the model gets punished for being especially bad in one of them, and thus represents the tradeoff between precision and recall a model faces. In the multiclass case like here, there are several ways to average the F1 score that is being generated for every class. Here we use macro-averaging, where a simple mean is taken that weights every class the same. This gives a higher importance to the F1 score for minority classes like 1 or 2 stars compared to their share of the data. The F1 scores are a little bit lower but not that much more than the simple accuracy, indicating that the finetuned models do not just classify better reviews and indeed are able to also classify bad reviews to a reasonable degree. In this metric roberta-base is quite a bit lower than bert with 0.5.

The last model represents an already finetuned model based on distilbert, that was trained to predict binary sentiments from english movie reviews. We replaced the classification head to output 5 labels and further finetuned this model on our yelp review data. The premise is that a model that is not just trained on general language but already specialised on reviews might perform better. In practice however the scores are well below those of the original transformers. Especially the F1 score is a lot lower than the accuracy and the F1 scores of the first two models, which can be due to a worse general performance, bigger gap between precision and recall and unequal performance on the different classes, favoring the majority classes.

### Summary

In general as expected the finetuned base transformer models are significantly better than the previously tried machine learning approaches with word counts or embeddings as features. They reach f1_macro scores of almost 0.6 against the 0.42 of Logistic Regression which was the best model in the previous experiment.

However increasing the amount of epochs did not seem to significantly improve test performance after 3 epochs for the two base transformers so one would need to find other ways to increase the performance on the yelp reviews even further.

As a last point one has to note, that the transformer models were only tested on a single test split. The choice of that split has a strong impact on evaluation of the model so the achieved numbers in this notebook are a lot less robust than those obtained via cross validation for the machine learning models.