# NLP LAB 3 - HuggingFace Transformers

Authors:
* Ethan MACHAVOINE
* Jonathan POELGER

In [1]:
import evaluate
from datasets import load_dataset
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer

  from .autonotebook import tqdm as notebook_tqdm


In [17]:
"""
Load the data from IMDB , creates a suitable tokenizer and tokenize the dataset.

input: 
    - String: the name of the checkpoint to use for the tokenizer
    
output:
    - Object: The tokenized dataset
    - Object: The tokenizer
    - String: The checkpoint name
"""
def load_data(checkpoint):
    raw_datasets = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    tokenized_datasets = raw_datasets.map(lambda example: tokenizer.__call__(example["text"], truncation=True), batched=True)
    return tokenized_datasets, tokenizer, checkpoint

"""
Creates a model using the outputs of load_data.

input:
    - Object: A tokenized dataset
    - Object: A tokenizer
    - String: The checkpoint name
    - Int: Number of epoch to train
    
outputs:
    - Object: A trainer for the model
    - Object: The tokenized dataset
"""
def create_model(tokenized_datasets, tokenizer, checkpoint ,num_train_epochs=1):
    training_args = TrainingArguments("test-trainer", num_train_epochs=num_train_epochs)
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator= DataCollatorWithPadding(tokenizer=tokenizer),
    tokenizer=tokenizer,
    )
    return trainer, tokenized_datasets

In [11]:
def fine_tune():
    checkpoint = "distilbert-base-uncased"

    trainer, tokenized_datasets = create_model(*load_data(checkpoint))
    trainer.train()

    predictions = trainer.predict(tokenized_datasets["test"])
    print(predictions.predictions.shape, predictions.label_ids.shape)
    return predictions

def compute_metrics(eval_preds):
    metric = evaluate.load("imdb")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#print(compute_metrics(fine_tune())) # very long to run

In [18]:
mvw_chechpoint = "mvonwyl/distilbert-base-uncased-imdb"

trainer, tokenized_datasets = create_model(*load_data(mvw_chechpoint))
predictions = trainer.predict(tokenized_datasets["test"])
predicted_labels = np.argmax(predictions.predictions, axis=1)

actual_labels = tokenized_datasets["test"]["label"]
accuracy = np.mean(predicted_labels == actual_labels) * 100

print("Accuracy:", accuracy)

Found cached dataset imdb (C:/Users/jo81/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:01<00:00,  2.31it/s]
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 10%|▉         | 305/3125 [32:01<5:37:57,  7.19s/it]

KeyboardInterrupt: 

In [None]:
X = tokenized_datasets["test"]["text"]
wrong = [[X[i],predicted_labels[i], actual_labels[i]] for i in range(len(X)) if predicted_labels[i] != actual_labels[i]]

## Wrongly classified comments


* Predicted positive, actual negative :\
First off let me say, If you haven't enjoyed a Van Damme movie since bloodsport, you probably will not like this movie. Most of these movies may not have the best plots or best actors but I enjoy these kinds of movies for what they are. This movie is much better than any of the movies the other action guys (Segal and Dolph) have thought about putting out the past few years. Van Damme is good in the movie, the movie is only worth watching to Van Damme fans. It is not as good as Wake of Death (which i highly recommend to anyone of likes Van Damme) or In hell but, in my opinion it's worth watching. It has the same type of feel to it as Nowhere to Run. Good fun stuff!")


To be honest, as a human I thought it was a positive comment when reading it, but it's labeled negative in IMDB. The reason why the model didn't classified this comment as negative is because the autor puts emphasis on the fact that it would be enjoyable to Van Damme fans. The model can be tricked by expressions like :
"Good fun stuff!"
"This movie is much better than any of the movies [...]"
"in my opinion it's worth watching."


* Predicted positive, actual negative :\
The only reason this movie is not given a 1 (awful) vote is that the acting of both Ida Lupino and Robert Ryan is superb. Ida Lupino who is lovely, as usual, becomes increasingly distraught as she tries various means to rid herself of a madman. Robert Ryan is terrifying as the menacing stranger whose character, guided only by his disturbed mind, changes from one minute to the next. Seemingly simple and docile, suddenly he becomes clever and threatening. Ms. Lupino's character was in more danger from that house she lived in and her own stupidity than by anyone who came along. She could not manage to get out of her of her own house: windows didn't open, both front and back doors locked and unlocked from the inside with a key. You could not have designed a worse fire-trap if you tried. She did not take the precaution of having even one extra key. Nor could she figure out how to summon help from nearby neighbors or get out of her own basement while she was locked in and out of sight of her captor. I don't know what war her husband was killed in, but if it was World War II, the furnishings in her house, the styles of the clothes, especially the children and the telephone company repairman's car are clearly anachronistic. I recommend watching this movie just to see what oddities you can find.


The author once again litteraly says in this comment that they recommend watching the movie. They clearly are laughing at what seems like a B-movie with 2 good actors that saves it a bit but the clear compliments like "both Ida Lupino and Robert Ryan is superb", "Ida Lupino who is lovely, as usual" and "I recommend watching this movie" may lead the model to classifyit as positive. 

* Predicted negative, actual positive :\
Although Bullet In The Brain is, without question, superior amongst short films, it largely seems more like a short piece of writing than a film. And it is a little hard to feel too sorry for the teacher when his smart ass remarks get him shot. But after the bullet enters his brain we begin to understand a little bit about why he became so jaded with life in the first place. There is an awful amount of detail packed into this reasonably short film and this is what makes me feel that it should have been extended a little bit - it seems like there's almost too much to take in at once as the details come flying at you so fast. A slightly more relaxed pace and a less po-faced narrator in the final section would have benefitted this film a little bit. Despite these complaints, there is no denying that Bullet In The Brain is a quite stupendous work compared to many short, and even full length films. The makers should be applauded for trying to make such a basically emotional and literate film in the current climate of quick jokes and Hollywood action.


In this comment the author tries to hint on some ways to make the movie better. They tought that it would have a good potential, but also that it's a bit wasted. The fact that they use constructive criticism throughout the whole comment may be why it was labelled negative.