# Requirements

In [1]:
# Add as many imports as you need.
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer,AutoModelForSequenceClassification,TrainingArguments,Trainer

import numpy as np
from sklearn.metrics import classification_report,precision_recall_fscore_support,accuracy_score

2025-01-16 16:59:50.889909: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1737043190.924463   26310 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1737043190.935023   26310 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-16 16:59:50.970192: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Laboratory Exercise - Run Mode (8 points)

## Introduction
This laboratory assignment's primary objective is to fine-tune a pre-trained language model for binary classification on a dataset consisting of Spotify user reviews. The dataset contains two attributes:

+ **review** - A text column containing user feedback, opinions, and experiences with the Spotify application.
+ **sentiment** - A categorical column indicating whether the review has a positive or negative sentiment.

Your task involves training a model to predict the **sentiment** (either "positive" or "negative") based on the content of the **review**.

## The Spotify User Reviews Dataset

Load the dataset using the `datasets` library.

In [2]:
# Write your code here. Add as many boxes as you need.
ds = load_dataset('csv', data_files={'train':'../datasets/spotify-user-reviews.csv'})
ds

DatasetDict({
    train: Dataset({
        features: ['review', 'label'],
        num_rows: 10000
    })
})

In [3]:
def change_labels(data):
    if data['label'] == 'positive':
        data['label'] = 1
    else:
        data['label'] = 0
    return data

In [4]:
ds = ds.map(change_labels)
ds['train']['label'][:5]


[1, 0, 1, 0, 1]

## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.


In [5]:
# Write your code here. Add as many boxes as you need.
split_ds = ds['train'].train_test_split(test_size=0.2,seed=42)
split_ds

DatasetDict({
    train: Dataset({
        features: ['review', 'label'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['review', 'label'],
        num_rows: 2000
    })
})

## Tokenization
Tokenize the texts using the `AutoTokenizer` class.

In [6]:
# Write your code here. Add as many boxes as you need.
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [7]:
def tokenize_func(example):
    tokenized = tokenizer(example['review'],padding=True,truncation=True,max_length=32)
    tokenized['labels'] = example['label']

    return tokenized

In [8]:
tokenized_ds = split_ds.map(tokenize_func,batched=True)
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

## Fine-tuning a Pre-trained Language Model for Classification
Fine-tune a pre-trained language model for classification on the given dataset.

Define the model using the `AutoModelForSequenceClassification` class.

In [9]:
# Write your code here. Add as many boxes as you need.
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define the traning parameters using the `TrainingArguments` class.

In [10]:
# Write your code here. Add as many boxes as you need.
args = TrainingArguments(
    output_dir='./trainer',
    per_device_eval_batch_size=16,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    eval_strategy='epoch',
    learning_rate=2e-5,
    fp16=True,
    metric_for_best_model='accuracy'
    
)

Define the training using the `Trainer` class.

In [11]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=1)

    accuracy = accuracy_score(labels,predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="weighted")

    return {
        "accuracy":accuracy,
        "precision":precision,
        "recall":recall,
        "f1":f1
    }
    

In [12]:
# Write your code here. Add as many boxes as you need.
trainer = Trainer(
    model=model,
    args=args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    processing_class=tokenizer
)

Fine-tune (train) the pre-trained lanugage model.

In [13]:
# Write your code here. Add as many boxes as you need.
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.284275,0.889,0.889001,0.889,0.889


TrainOutput(global_step=125, training_loss=0.36181317138671876, metrics={'train_runtime': 140.3855, 'train_samples_per_second': 56.986, 'train_steps_per_second': 0.89, 'total_flos': 66233699328000.0, 'train_loss': 0.36181317138671876, 'epoch': 1.0})

Use the trained model to make predictions for the test set.

In [14]:
predictions = trainer.predict(tokenized_ds['test'])

In [16]:
# Write your code here. Add as many boxes as you need.
y_pred = np.argmax(predictions.predictions, axis=1)  # Convert logits to class predictions
y_true = np.array(tokenized_ds['test']['label'])


Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [17]:
# Write your code here. Add as many boxes as you need.
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89      1001
           1       0.89      0.89      0.89       999

    accuracy                           0.89      2000
   macro avg       0.89      0.89      0.89      2000
weighted avg       0.89      0.89      0.89      2000



# Laboratory Exercise - Bonus Task (+ 2 points)

Implement a machine learning pipeline to classify Spotify user reviews as positive or negative. Use TF-IDF vectorization to transform the review text into numerical features, and train a logistic regression model on the transformed data. Split the dataset into training and testing sets, fit the pipeline on the training data, and evaluate its performance using metrics such as precision, recall, and F1-score. To gain insights into the most influential words or phrases associated with positive and negative reviews, analyze the coefficients from the logistic regression model trained on the TF-IDF features. Present the top keywords for each sentiment in a table or a bar chart to provide a clear understanding of the terms driving user feedback.

In [None]:
# Write your code here. Add as many boxes as you need.