Note: Some console outputs (print statements) were generated during the initial model runs in Spanish.
Since these steps involve computationally expensive models, the notebook
is not re-executed to avoid unnecessary computation. The code has been fully translated into English
for reproducibility, and the stored outputs do not affect the functioning of the notebook.

In [None]:
# GPU information (if available)
!nvidia-smi

import torch

if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    total_vram = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"Detected GPU: {device_name}")
    print(f"Total VRAM: {total_vram:.2f} GB")
else:
    print("No GPU detected. Running on CPU.")

Tue Dec  9 10:16:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   34C    P0             52W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import pandas as pd

# Load dataset generated in Notebook 4
df = pd.read_pickle(INPUT_FILE)
print(f"Dataset loaded: {df.shape} rows")
print(f"Columns: {list(df.columns)}")

Dataset loaded: (74759, 21) rows
Columns: ['review_text', 'review_en', 'rating', 'date', 'user_total_reviews', 'user_id', 'is_local_guide', 'lang', 'park_name', 'text_length', 'year', 'month', 'quarter', 'month_name', 'season', 'gender', 'text_bert', 'text_stats', 'text_en_clean', 'topic_id', 'topic_label']


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74759 entries, 0 to 74758
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   review_text         74759 non-null  object             
 1   review_en           74759 non-null  object             
 2   rating              74759 non-null  int64              
 3   date                74759 non-null  datetime64[ns, UTC]
 4   user_total_reviews  74759 non-null  int64              
 5   user_id             74759 non-null  object             
 6   is_local_guide      74759 non-null  bool               
 7   lang                74759 non-null  object             
 8   park_name           74759 non-null  object             
 9   text_length         74759 non-null  int64              
 10  year                74759 non-null  int32              
 11  month               74759 non-null  int32              
 12  quarter             74759 non-nu

In [None]:
from transformers import pipeline

# Dictionary to store all sentiment-analysis models
models = {}

# -------------------------------------------------------------------------
# Model 1: Multilingual BERT (predicts 1–5 star ratings)
models["bert_5stars"] = pipeline(
    task="sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment",
    tokenizer="nlptown/bert-base-multilingual-uncased-sentiment"
)

# -------------------------------------------------------------------------
# Model 2: Multilingual DistilBERT (negative / neutral / positive)
models["distilbert_multilingual"] = pipeline(
    task="sentiment-analysis",
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
    tokenizer="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
    return_all_scores=False
)

# -------------------------------------------------------------------------
# Model 3: XLM-RoBERTa multilingual (negative / neutral / positive)
models["xlmr_twitter"] = pipeline(
    task="sentiment-analysis",
    model="cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual",
    tokenizer="cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual",
    return_all_scores=False
)

print("Models successfully loaded:")
for name in models:
    print(" -", name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/759 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/982 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Device set to use cuda:0


Modelos cargados correctamente:
 - nlptown_5stars
 - distilbert_multi
 - xlmr_twitter_multi


In [None]:
import re

# Convert a 1–5 star rating into a three-class polarity label
def map_stars_to_sentiment(stars):
    if stars <= 2:
        return "negative"
    elif stars == 3:
        return "neutral"
    else:
        return "positive"

# Basic preprocessing to ensure the input text is a valid string
def _preprocess_text(t):
    if not isinstance(t, str):
        return ""
    return t.strip()

# Unified sentiment prediction function for all loaded models
def predict_sentiment(text, model_key):
    text = _preprocess_text(text)

    # Assign neutral sentiment for empty inputs
    if text == "":
        return "neutral"

    clf = models[model_key]

    # Inference with safe truncation for long inputs
    result = clf(
        text,
        truncation=True,
        max_length=512
    )

    res = result[0] if isinstance(result, list) else result
    label = res.get("label", "")
    label_lower = label.lower()

    # Special case: the nlptown model returns labels like "1 star", "5 stars", etc.
    if model_key == "bert_5stars":
        match = re.search(r"(\d)", label_lower)
        if match:
            stars_pred = int(match.group(1))
            return map_stars_to_sentiment(stars_pred)
        return "neutral"

    # For all other models, assume standard labels:
    # "negative", "neutral" or "positive"
    return label_lower

In [None]:
from sklearn.metrics import classification_report

# Ground truth label derived from the numeric rating
df["sentiment_true"] = df["rating"].apply(map_stars_to_sentiment)

# Subsample to reduce computational cost
df_sample = df.sample(1000, random_state=42).copy()

# Models to evaluate: (model_key, descriptive_label)
models_to_evaluate = [
    ("distilbert_multilingual", "DistilBERT Multilingual"),
    ("xlmr_twitter", "XLM-RoBERTa Twitter Multilingual"),
    ("bert_5stars", "Multilingual BERT (1–5 stars)")
]

for model_key, model_label in models_to_evaluate:
    print("\n===========================================================")
    print(f"  Model: {model_label}")
    print("===========================================================")

    pred_col = f"pred_{model_key}"

    # Sentiment prediction for the current model
    df_sample[pred_col] = df_sample["text_bert"].apply(
        lambda t: predict_sentiment(t, model_key)
    )

    # Filter valid predictions
    mask = df_sample[pred_col].notna()
    y_true = df_sample.loc[mask, "sentiment_true"]
    y_pred = df_sample.loc[mask, pred_col]

    # Classification report
    print(classification_report(y_true, y_pred, digits=4))


  Modelo: DistilBERT multilingüe


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


              precision    recall  f1-score   support

    negative     0.5000    0.7958    0.6141       240
     neutral     0.1176    0.0146    0.0260       137
    positive     0.8270    0.7978    0.8121       623

    accuracy                         0.6900      1000
   macro avg     0.4815    0.5361    0.4841      1000
weighted avg     0.6513    0.6900    0.6569      1000


  Modelo: XLM-RoBERTa Twitter multilingüe
              precision    recall  f1-score   support

    negative     0.5167    0.9042    0.6576       240
     neutral     0.2087    0.1752    0.1905       137
    positive     0.9333    0.6966    0.7978       623

    accuracy                         0.6750      1000
   macro avg     0.5529    0.5920    0.5486      1000
weighted avg     0.7341    0.6750    0.6809      1000


  Modelo: BERT multilingüe 1–5 estrellas (nlptown)
              precision    recall  f1-score   support

    negative     0.6743    0.8542    0.7537       240
     neutral     0.3642    0.4015 

In [None]:
# Directory for intermediate checkpoints
CHECKPOINT_DIR = MODELS_DIR / "bert_nlptown_checkpoints"

# Directory for the final fine-tuned model (used later for inference)
FINAL_MODEL_DIR = MODELS_DIR / "bert_nlptown_finetuned_v1"

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from sklearn.metrics import accuracy_score, f1_score
import time

# ============================================================
# 1. Dataset preparation
# ============================================================
print("Preparing dataset splits (train/val/test)...")

df_tuning = df[['text_bert', 'rating']].copy()
df_tuning['label'] = df_tuning['rating'] - 1  # Convert to 0–4 scale

# Step 1: Hold-out test set (15%)
train_val_df, test_df = train_test_split(
    df_tuning,
    test_size=0.15,
    random_state=42,
    stratify=df_tuning['label']
)

# Step 2: Train/Validation split inside remaining 85%
# Validation should represent ~15% of the total dataset → 0.15 / 0.85 ≈ 0.176
train_df, val_df = train_test_split(
    train_val_df,
    test_size=0.176,
    random_state=42,
    stratify=train_val_df['label']
)

print(f"Final sizes → Train: {len(train_df)} | Val: {len(val_df)} | Test: {len(test_df)}")

dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
    "test": Dataset.from_pandas(test_df)
})

# ============================================================
# 2. Tokenization
# ============================================================
print("Tokenizing dataset...")

model_checkpoint = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(batch):
    return tokenizer(batch["text_bert"], truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# ============================================================
# 3. Model and metrics
# ============================================================
print("Loading base model...")

model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=5
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="weighted")
    return {"accuracy": acc, "f1": f1}

# ============================================================
# 4. Fine-tuning
# ============================================================
print("Starting fine-tuning on GPU...")

training_args = TrainingArguments(
    output_dir=str(CHECKPOINT_DIR),
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    fp16=True,
    report_to="none",

    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

print("Training started...")

start = time.time()
trainer.train()
elapsed = time.time() - start

print(f"Training completed in {int(elapsed // 60)} min {int(elapsed % 60)} sec")

  print(f"\Entrenamiento finalizado en {minutes} minutos y {seconds} segundos")


1. Preparando el dataset (Train / Val / Test)...
Tamaños finales -> Train: 52361, Val: 11184, Test: 11214
2. Tokenizando...


Map:   0%|          | 0/52361 [00:00<?, ? examples/s]

Map:   0%|          | 0/11184 [00:00<?, ? examples/s]

Map:   0%|          | 0/11214 [00:00<?, ? examples/s]

3. Cargando modelo base...
4. Iniciando entrenamiento en GPU A100...


  trainer = Trainer(


Iniciando cronómetro...
Entrenando modelo (Paciencia: 3 épocas)...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.883,0.834479,0.645833,0.632448
2,0.8089,0.875229,0.639306,0.631296
3,0.7114,0.929928,0.640558,0.631321
4,0.6115,0.997314,0.626431,0.623112


\Entrenamiento finalizado en 12 minutos y 30 segundos
5. Guardando modelo final...
Fine-tuning completado. Modelo guardado en: /content/drive/MyDrive/UOC/Master en Ciencia de Datos/2025.1/TFM/TFM_Analisis_Reviews_Parques_Tematicos/04_Models/bert_nlptown_finetuned_v1


In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split

# ============================================================
# 1. Model and path configuration
# ============================================================
local_model_path = str(FINAL_MODEL_DIR)
base_tokenizer_name = "nlptown/bert-base-multilingual-uncased-sentiment"

print(f"Model path: {local_model_path}")
print(f"Tokenizer source: {base_tokenizer_name}")

# ============================================================
# 2. Load fine-tuned model and tokenizer
# ============================================================
try:
    loaded_model = AutoModelForSequenceClassification.from_pretrained(local_model_path)
    loaded_tokenizer = AutoTokenizer.from_pretrained(base_tokenizer_name)
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print(f"Error loading model: {e}")
    raise e

sentiment_pipeline = pipeline(
    task="sentiment-analysis",
    model=loaded_model,
    tokenizer=loaded_tokenizer,
    device=0
)

# ============================================================
# 3. Reconstruct dataset split (train/val/test)
# ============================================================
print("Reconstructing dataset split (random_state=42)...")

stratify_col = df["rating"] if "rating" in df.columns else None

# Phase 1: train+val vs. test
ids_train_val, ids_test = train_test_split(
    df.index,
    test_size=0.15,
    random_state=42,
    stratify=stratify_col
)

# Phase 2: train vs. validation
stratify_sub = df.loc[ids_train_val, "rating"] if stratify_col is not None else None

ids_train, ids_val = train_test_split(
    ids_train_val,
    test_size=0.176,
    random_state=42,
    stratify=stratify_sub
)

df["dataset_split"] = "unassigned"
df.loc[ids_train, "dataset_split"] = "train"
df.loc[ids_val, "dataset_split"] = "validation"
df.loc[ids_test, "dataset_split"] = "test"

print("Split distribution:")
print(df["dataset_split"].value_counts())

# ============================================================
# 4. Inference on full dataset
# ============================================================
print(f"\nRunning inference on {len(df)} samples...")

start_time = time.time()

input_texts = df["text_bert"].tolist()
raw_predictions = []

# Manual batching because HF pipeline does not stream batches
batch_size = 128
for i in tqdm(range(0, len(input_texts), batch_size)):
    batch = input_texts[i:i + batch_size]
    outputs = sentiment_pipeline(
        batch,
        truncation=True,
        max_length=512
    )
    raw_predictions.extend(outputs)

elapsed = time.time() - start_time
print(f"Inference completed in {int(elapsed // 60)}m {int(elapsed % 60)}s.")

# ============================================================
# 5. Post-processing
# ============================================================
print("Mapping predictions...")

def extract_rating(pred):
    """Extract numeric score from labels such as '3 stars'."""
    label = pred["label"].lower().strip()
    num = label.split(" ")[0]
    return int(num)

def map_sentiment(r):
    """Map 1–5 star score into sentiment classes."""
    if r <= 2:
        return "negative"
    if r == 3:
        return "neutral"
    return "positive"

df["predicted_rating"] = [extract_rating(p) for p in raw_predictions]
df["final_sentiment"] = df["predicted_rating"].apply(map_sentiment)

# ============================================================
# 6. Export
# ============================================================
pkl_path = PROCESSED_DIR / "05_dataset_sentiment_topics.pkl"
df.to_pickle(pkl_path)

print(f"Dataset exported to: {pkl_path}")

Ruta del modelo (Local): /content/drive/MyDrive/UOC/Master en Ciencia de Datos/2025.1/TFM/TFM_Analisis_Reviews_Parques_Tematicos/04_Models/bert_nlptown_finetuned_v1
Tokenizador base (Hugging Face): nlptown/bert-base-multilingual-uncased-sentiment


Device set to use cuda:0


Modelo y Tokenizador cargados correctamente
Reconstruyendo la separación usando random_state=42...
Etiquetando filas en el DataFrame principal...
Distribución recuperada
dataset_split
train         52361
test          11214
validation    11184
Name: count, dtype: int64
Reconstrucción exitosa

Iniciando cronómetro...
Procesando 74759 registros en NVIDIA A100...


  0%|          | 0/74759 [00:00<?, ?it/s]

Inferencia completada en 4m 17s.
Mapeando resultados...

Dataset guardado en: /content/drive/MyDrive/UOC/Master en Ciencia de Datos/2025.1/TFM/TFM_Analisis_Reviews_Parques_Tematicos/02_Data/05_dataset_sentiment_topics.csv
