 Employ [Hugging Face](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=sentiment) transformers for the same classification task as in the first assignment.

Explore Hugging Face models to find a pre-trained model that is suitable and promising for fine-tuning to your task. It should make sense to pick one that has been pre-trained for the same language and/or text genre.

As a bonus, you can also employ a [domain adaptation](https://huggingface.co/learn/llm-course/chapter7/3?fw=pt) approach, explore [parameter-efficient fine-tuning](https://huggingface.co/docs/peft/main/quicktour) (e.g. LoRA), or [prompting language models](https://huggingface.co/docs/transformers/v4.49.0/en/tasks/prompting).

We must ompare the performance of your model(s) with the ones developed for the first assignment.

Most of the models have problems processing the text!!!

In [14]:
import utils
from utils import CustomDataset
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline

In [15]:
combined_sentiment_df = pd.read_csv("../common/data_sentiment_preprocessed.csv")
combined_sentiment_df_val = pd.read_csv("../common/data_sentiment_preprocessed_val.csv")

In [16]:
x_train = combined_sentiment_df.text
y_train = combined_sentiment_df.sentiment_label
x_val = combined_sentiment_df_val.text
y_val = combined_sentiment_df_val.sentiment_label

## Making use of pretrained huggingface models

### twitter-roberta-base-sentiment-latest
This model has positive negative and neutral

In [17]:
# https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
from transformers import pipeline

cardiffnlp_roberta = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")


print(cardiffnlp_roberta("I love you!"))
print(cardiffnlp_roberta("I hate you!"))
print(cardiffnlp_roberta("neutral text"))




Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


[{'label': 'positive', 'score': 0.9748510718345642}]
[{'label': 'negative', 'score': 0.8965417742729187}]
[{'label': 'neutral', 'score': 0.7886281609535217}]


In [18]:
# cardiffnlp_roberta
mapper = {
    "negative": 0,
    "positive": 1,
    "neutral": 1
}
utils.apply_kaggle_model(cardiffnlp_roberta, mapper, x_val, y_val)


Error processing text at index 697: Positives: First time going to this place today. Let me tell you, coming from a family of chefs this was delectable, the dine in meals came out fast, they were LARGE portions, and very good temperature. We ordered the flowered onion( fried and whole), we ordered the Louisiana chook both entrees. Then I had the parmigiana as my main with mash and veg. The mash and veg was perfectly cooked, though the mash tastes a little like packet mash. The sauce with the Louisiana chicken is a little spicy so if you ca n’t tolerate a little spice the sauce is n’t for you. But man oh man the crunch on the chook and the juicy chicken was incredible, was thoroughly enjoyable. The parmigiana was LARGE so much so I could n’t finish it all. Great that they gave takeaways Negatives: The drink I ordered was the summer one in the mocktails section, tasted great only issue I really had was the lemon seeds in the drink, lucky the straws were n’t big enough to suck them up oth

### sentiment-roberta-large-english

In [19]:
#https://huggingface.co/siebert/sentiment-roberta-large-english?library=transformers

"""
    article: https://www.sciencedirect.com/science/article/pii/S0167811622000477
"""

from transformers import pipeline

siebert_roberta = pipeline("text-classification", model="siebert/sentiment-roberta-large-english")


print(siebert_roberta("I love you!"))
print(siebert_roberta("I hate you!"))
print(siebert_roberta("neutral text"))


Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9987329840660095}]
[{'label': 'NEGATIVE', 'score': 0.9992897510528564}]
[{'label': 'POSITIVE', 'score': 0.9969080090522766}]


In [20]:
#siebert_roberta
mapper = {
    "NEGATIVE": 0,
    "POSITIVE": 1
} 
utils.apply_kaggle_model(siebert_roberta, mapper, x_val, y_val)

Token indices sequence length is longer than the specified maximum sequence length for this model (574 > 512). Running this sequence through the model will result in indexing errors


Error processing text at index 697: Positives: First time going to this place today. Let me tell you, coming from a family of chefs this was delectable, the dine in meals came out fast, they were LARGE portions, and very good temperature. We ordered the flowered onion( fried and whole), we ordered the Louisiana chook both entrees. Then I had the parmigiana as my main with mash and veg. The mash and veg was perfectly cooked, though the mash tastes a little like packet mash. The sauce with the Louisiana chicken is a little spicy so if you ca n’t tolerate a little spice the sauce is n’t for you. But man oh man the crunch on the chook and the juicy chicken was incredible, was thoroughly enjoyable. The parmigiana was LARGE so much so I could n’t finish it all. Great that they gave takeaways Negatives: The drink I ordered was the summer one in the mocktails section, tasted great only issue I really had was the lemon seeds in the drink, lucky the straws were n’t big enough to suck them up oth

### AG6019/reddit-comment-sentiment

In [21]:
# https://huggingface.co/AG6019/reddit-comment-sentiment?library=transformers
from transformers import pipeline

AG6019_comment = pipeline("text-classification", model="AG6019/reddit-comment-sentiment")

print(AG6019_comment("I love you!"))
print(AG6019_comment("I dont like you!"))
print(AG6019_comment("neutral text"))

Device set to use mps:0


[{'label': 'LABEL_1', 'score': 0.9998805522918701}]
[{'label': 'LABEL_0', 'score': 0.8577982187271118}]
[{'label': 'LABEL_0', 'score': 0.9975653886795044}]


In [22]:
mapper = {
    "LABEL_0": 0,
    "LABEL_1": 1,
}
utils.apply_kaggle_model(AG6019_comment, mapper, x_val, y_val)

Token indices sequence length is longer than the specified maximum sequence length for this model (574 > 512). Running this sequence through the model will result in indexing errors


Error processing text at index 697: Positives: First time going to this place today. Let me tell you, coming from a family of chefs this was delectable, the dine in meals came out fast, they were LARGE portions, and very good temperature. We ordered the flowered onion( fried and whole), we ordered the Louisiana chook both entrees. Then I had the parmigiana as my main with mash and veg. The mash and veg was perfectly cooked, though the mash tastes a little like packet mash. The sauce with the Louisiana chicken is a little spicy so if you ca n’t tolerate a little spice the sauce is n’t for you. But man oh man the crunch on the chook and the juicy chicken was incredible, was thoroughly enjoyable. The parmigiana was LARGE so much so I could n’t finish it all. Great that they gave takeaways Negatives: The drink I ordered was the summer one in the mocktails section, tasted great only issue I really had was the lemon seeds in the drink, lucky the straws were n’t big enough to suck them up oth

### DT12the/distilbert-sentiment-analysis

In [23]:
# Use a pipeline as a high-level helper
from transformers import pipeline

DT12the = pipeline("text-classification", model="DT12the/distilbert-sentiment-analysis")
print(DT12the("I don't like you!"))
print(DT12the("this is really good!"))
print(DT12the("neutral text"))

Device set to use mps:0


[{'label': 'LABEL_1', 'score': 0.9046109914779663}]
[{'label': 'LABEL_0', 'score': 0.9963231086730957}]
[{'label': 'LABEL_0', 'score': 0.915234386920929}]


In [24]:
mapper = {
    "LABEL_0": 1,
    "LABEL_1": 0,
}
utils.apply_kaggle_model(DT12the, mapper, x_val, y_val)

Token indices sequence length is longer than the specified maximum sequence length for this model (574 > 512). Running this sequence through the model will result in indexing errors


Error processing text at index 697: Positives: First time going to this place today. Let me tell you, coming from a family of chefs this was delectable, the dine in meals came out fast, they were LARGE portions, and very good temperature. We ordered the flowered onion( fried and whole), we ordered the Louisiana chook both entrees. Then I had the parmigiana as my main with mash and veg. The mash and veg was perfectly cooked, though the mash tastes a little like packet mash. The sauce with the Louisiana chicken is a little spicy so if you ca n’t tolerate a little spice the sauce is n’t for you. But man oh man the crunch on the chook and the juicy chicken was incredible, was thoroughly enjoyable. The parmigiana was LARGE so much so I could n’t finish it all. Great that they gave takeaways Negatives: The drink I ordered was the summer one in the mocktails section, tasted great only issue I really had was the lemon seeds in the drink, lucky the straws were n’t big enough to suck them up oth

## Bonus

### training models

#### AG6019/reddit-comment-sentiment

In [10]:

# https://huggingface.co/AG6019/reddit-comment-sentiment


AG6019_tokenizer = AutoTokenizer.from_pretrained("AG6019/reddit-comment-sentiment")
AG6019_model = AutoModelForSequenceClassification.from_pretrained("AG6019/reddit-comment-sentiment", num_labels=2)


print(AG6019_model.config.id2label)

num_parameters = AG6019_model.num_parameters() / 1_000_000
print(f"Number of parameters: {num_parameters:.2f}M")


train_encodings = utils.tokenize_data(x_train, AG6019_tokenizer)
val_encodings = utils.tokenize_data(x_val, AG6019_tokenizer)

{0: 'LABEL_0', 1: 'LABEL_1'}
Number of parameters: 66.96M


In [None]:
train_dataset = CustomDataset(train_encodings, y_train)
val_dataset = CustomDataset(val_encodings, y_val)

training_args = TrainingArguments(
    output_dir="./AG6019_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    eval_strategy="epoch",
    logging_dir="./AG6019_logs",
    learning_rate=2e-5,
    weight_decay=0.01,
)

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": accuracy_score(p.label_ids, preds),
        "f1": f1_score(p.label_ids, preds),
    }

trainer = Trainer(
    model=AG6019_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

# Save the model
trainer.save_model("AG6019_model")
# Save the tokenizer
AG6019_tokenizer.save_pretrained("AG6019_model")



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3293,0.286509,0.894389,0.890598
2,0.245,0.356249,0.890264,0.891251
3,0.1538,0.472964,0.891914,0.891646


AttributeError: 'TrainingArguments' object has no attribute 'save_to_json'

#### siebert/sentiment-roberta-large-english

In [4]:

# https://huggingface.co/siebert/sentiment-roberta-large-english


siebert_tokenizer = AutoTokenizer.from_pretrained("siebert/sentiment-roberta-large-english")
siebert_model = AutoModelForSequenceClassification.from_pretrained("siebert/sentiment-roberta-large-english")
print(siebert_model.config.id2label)


num_parameters = siebert_model.num_parameters() / 1_000_000
print(f"Number of parameters: {num_parameters:.2f}M")


train_encodings = utils.tokenize_data(x_train, siebert_tokenizer)
val_encodings = utils.tokenize_data(x_val, siebert_tokenizer)

{0: 'NEGATIVE', 1: 'POSITIVE'}
Number of parameters: 355.36M


In [None]:
train_dataset = CustomDataset(train_encodings, y_train)
val_dataset = CustomDataset(val_encodings, y_val)


training_args = TrainingArguments(
    output_dir="./siebert_results",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    eval_strategy="epoch",
    logging_dir="./siebert_logs",
    learning_rate=2e-5,
    weight_decay=0.01,
)


def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": accuracy_score(p.label_ids, preds),
        "f1": f1_score(p.label_ids, preds),
    }

trainer = Trainer(
    model=siebert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

# Save the model
trainer.save_model("siebert_model")
# Save the tokenizer
siebert_tokenizer.save_pretrained("siebert_model")





Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0,,0.504125,0.0
2,0.0,,0.504125,0.0


AttributeError: 'TrainingArguments' object has no attribute 'save_to_json'

#### DT12the/distilbert-sentiment-analysis

In [8]:

# https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

DT12the_tokenizer = AutoTokenizer.from_pretrained("DT12the/distilbert-sentiment-analysis")
DT12the_model = AutoModelForSequenceClassification.from_pretrained("DT12the/distilbert-sentiment-analysis", num_labels=2)


print(DT12the_model.config.id2label)




mapping = {0: 1, 1: 0}
y_train_inverted = y_train.map(mapping)
y_val_inverted = y_val.map(mapping)



num_parameters = DT12the_model.num_parameters() / 1_000_000


train_encodings = utils.tokenize_data(x_train, DT12the_tokenizer)
val_encodings = utils.tokenize_data(x_val, DT12the_tokenizer)

{0: 'LABEL_0', 1: 'LABEL_1'}


In [None]:
train_dataset = CustomDataset(train_encodings, y_train_inverted)
val_dataset = CustomDataset(val_encodings, y_val_inverted)

training_args = TrainingArguments(
    output_dir="./DT12the_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    eval_strategy="epoch",
    logging_dir="./DT12the_logs",
    learning_rate=2e-5,
    weight_decay=0.01,
)

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "accuracy": accuracy_score(p.label_ids, preds),
        "f1": f1_score(p.label_ids, preds),
    }

trainer = Trainer(
    model=DT12the_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

# Save the model
trainer.save_model("DT12the_model")
# Save the tokenizer
DT12the_tokenizer.save_pretrained("DT12the_model")