---

#**Guide Code Explanation**

---


1. This code imports essential libraries for fine-tuning a DistilBERT model on a text classification task. The torch library enables tensor computation and GPU acceleration for deep learning. load_dataset from datasets is used to import and manage datasets efficiently. DistilBertForSequenceClassification and DistilBertTokenizerFast from transformers provide the pre-trained DistilBERT model and its corresponding tokenizer for text processing. TrainingArguments and Trainer handle model training configurations and execution. numpy supports numerical operations, while accuracy_score and f1_score from sklearn.metrics evaluate model performance based on prediction accuracy and balance between precision and recall.

In [None]:
# ==============================================
# 1. IMPORT LIBRARIES
# ==============================================
import torch
from datasets import load_dataset
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from transformers import pipeline

2. This code checks for GPU availability to optimize model training performance. The torch.cuda.is_available() function verifies if a CUDA-enabled GPU is accessible; if true, torch.device("cuda") assigns computations to the GPU and prints the device name using torch.cuda.get_device_name(0). Otherwise, it defaults to the CPU with torch.device("cpu") and displays a fallback message. The next line prints a header indicating the start of data loading and preprocessing. Finally, load_dataset("sst2") loads the Stanford Sentiment Treebank (SST-2) dataset, a benchmark dataset used for binary sentiment classification (positive vs. negative).

In [None]:
# ==============================================
# 2. DEVICE SETUP (GPU / CPU)
# ==============================================
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")
dataset = load_dataset("sst2")

Using GPU: Tesla T4

--- Loading and Preprocessing Data ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

3. This code extracts smaller subsets from the SST-2 dataset for faster experimentation. The dataset["train"] and dataset["validation"] access the training and validation splits, respectively. The .select(range(2000)) method retrieves the first 2,000 samples from the training data, while .select(range(500)) takes the first 500 samples from the validation set. These subsets reduce computation time and resource usage during model training and evaluation, making it ideal for testing and debugging before running the full dataset.

In [None]:
# Use a smaller subset to simulate an undergraduate project scale
train_data = dataset["train"].select(range(2000))
eval_data = dataset["validation"].select(range(500))

4. This code initializes the tokenizer for the DistilBERT model. The variable MODEL_NAME stores the identifier "distilbert-base-uncased", which refers to a pre-trained version of DistilBERT that converts all text to lowercase (uncased). The DistilBertTokenizerFast.from_pretrained(MODEL_NAME) function loads the corresponding fast tokenizer from Hugging Face’s model hub, enabling efficient tokenization, converting raw text into numerical token IDs that the model can process. This tokenizer also handles padding, truncation, and special tokens needed for the DistilBERT architecture.

In [None]:
# ==============================================
# 3. TOKENIZER SETUP
# ==============================================
MODEL_NAME = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

5. This code defines a function tokenize_function that preprocesses text data for model input. It takes a batch of dataset examples as input, accesses the "sentence" field from each example, and passes it to the tokenizer for conversion into token IDs understood by DistilBERT. The parameters truncation=True ensure that long sentences are shortened to fit the model’s maximum token length, while padding=True adds padding tokens to shorter sentences so all sequences in a batch have equal length. The function outputs a dictionary containing tokenized representations such as input_ids and attention_mask.

In [None]:
def tokenize_function(examples):
    # Converts text into token IDs
    return tokenizer(examples["sentence"], truncation=True, padding=True)

6. This code applies the tokenize_function to both training and evaluation subsets to prepare them for model input. The .map() method processes the dataset efficiently by applying the function to each batch of samples, with batched=True enabling faster batch-wise tokenization instead of one-by-one processing. The resulting tokenized_train and tokenized_eval datasets contain additional fields such as input_ids and attention_mask, which are required by the DistilBERT model for text classification tasks.

In [None]:
# Tokenize the data
tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_eval = eval_data.map(tokenize_function, batched=True)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

7. This code renames the "label" column to "labels" in both the tokenized training and evaluation datasets. The change is necessary because Hugging Face’s Trainer API expects the target variable to be named "labels" for supervised learning tasks. Using .rename_column("label", "labels") ensures that the dataset columns align with the model’s expected input format, allowing the trainer to correctly map input texts to their corresponding output labels during training and evaluation.

In [None]:
# Rename label column for Hugging Face Trainer
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

8. This code formats the tokenized datasets so they can be directly used with PyTorch models. The .set_format("torch") method converts the specified columns input_ids, attention_mask, and labels, into PyTorch tensors, which are the required data type for model training and inference. By explicitly defining these columns, the dataset provides only the relevant inputs (token IDs and attention masks) and outputs (labels) expected by the DistilBERT model, ensuring smooth integration with the Hugging Face Trainer framework.

In [None]:
# Convert to PyTorch tensors
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

9. This code loads the pre-trained DistilBERT model and prepares it for binary text classification. The function DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2) initializes DistilBERT with two output labels, suitable for tasks like sentiment analysis (positive vs. negative). The .to(device) method moves the model to the appropriate computation device (GPU or CPU) for faster processing. Finally, print(f"Model loaded: {MODEL_NAME}") confirms successful model loading by displaying the model’s name.

In [None]:
# ==============================================
# 4. MODEL DEFINITION
# ==============================================
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)
print(f"Model loaded: {MODEL_NAME}")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded: distilbert-base-uncased


10. This code defines the compute_metrics function used to evaluate model performance after each training or validation step. The function takes p, an object containing model predictions and true labels. The line preds = np.argmax(p.predictions, axis=1) selects the class with the highest predicted probability for each example. Then, accuracy_score measures how often predictions match the true labels, while f1_score (with average="binary") calculates the balance between precision and recall for binary classification. The function returns these metrics as a dictionary, enabling the Hugging Face Trainer to report accuracy and F1-score during evaluation.

In [None]:
# ==============================================
# 5. METRICS AND TRAINING SETUP
# ==============================================
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="binary")
    return {"accuracy": acc, "f1": f1}

11. This code configures the training setup using Hugging Face’s TrainingArguments class. The output_dir specifies where to save model checkpoints and results, while num_train_epochs=3 sets the model to train for three full passes over the dataset. per_device_train_batch_size and per_device_eval_batch_size define how many samples are processed per GPU or CPU batch during training and evaluation. warmup_steps=500 gradually increases the learning rate at the start for stability, and weight_decay=0.01 helps prevent overfitting. Logging is handled by logging_dir and logging_steps, which record training progress every 100 steps. The evaluation_strategy and save_strategy set evaluations and model saving to occur after each epoch, while load_best_model_at_end=True reloads the best-performing model automatically. Lastly, fp16 enables faster mixed-precision training on GPUs, and report_to=[] disables third-party logging tools like Weights & Biases.

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),  # Use 16-bit precision if GPU is available
    report_to=[]  # Disable W&B logging
)

12. This code initializes the Hugging Face Trainer, which automates the training and evaluation process. The model parameter loads the DistilBERT sequence classification model, while args=training_args applies the previously defined training configurations. The train_dataset and eval_dataset supply the tokenized training and validation data, respectively. The compute_metrics function is passed to calculate accuracy and F1-score during evaluation, and the tokenizer ensures consistency in text preprocessing and decoding. Together, these parameters enable the Trainer to handle optimization, evaluation, logging, and model saving with minimal manual coding.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

  trainer = Trainer(


13. This code initiates the fine-tuning process for the DistilBERT model. The print() statement displays a message indicating that training is starting and provides an estimated duration when using a GPU. The trainer.train() command triggers the Hugging Face Trainer to begin model training based on the configurations defined earlier such as epochs, batch sizes, and evaluation strategy. During this process, the model updates its weights using the training data, evaluates on the validation set after each epoch, logs performance metrics, and saves checkpoints for the best-performing version.

In [None]:
# ==============================================
# 6. EXECUTION - TRAINING
# ==============================================
print("\n--- Starting Fine-Tuning (Expected Time: 1–4 hours on GPU) ---")
trainer.train()


--- Starting Fine-Tuning (Expected Time: 1–4 hours on GPU) ---


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6669,0.448082,0.794,0.777538
2,0.3669,0.426618,0.814,0.845258
3,0.2566,0.392161,0.86,0.878893


TrainOutput(global_step=375, training_loss=0.3906671905517578, metrics={'train_runtime': 59.8991, 'train_samples_per_second': 100.169, 'train_steps_per_second': 6.261, 'total_flos': 86931730368000.0, 'train_loss': 0.3906671905517578, 'epoch': 3.0})

14. This code performs the final evaluation and saves the fine-tuned model. The print() statement announces the evaluation phase, followed by trainer.evaluate(), which assesses the model’s performance on the validation dataset using the defined compute_metrics function, returning metrics like accuracy and F1-score stored in eval_results. Printing eval_results displays these evaluation scores. The line trainer.save_model("./sentiment_distilbert_best") saves the best-performing model checkpoint to the specified directory for future inference or deployment. Finally, the last print statement confirms that the fine-tuning process has finished and the trained model is ready for sentiment prediction tasks.

In [None]:
# ==============================================
# 7. FINAL EVALUATION
# ==============================================
print("\n--- Final Evaluation Results ---")
eval_results = trainer.evaluate()
print(eval_results)

# Save best model checkpoint for future inference
trainer.save_model("./sentiment_distilbert_best")

print("\nFine-tuning process complete. The resulting model can now be used for inference.")


--- Final Evaluation Results ---


{'eval_loss': 0.3921605348587036, 'eval_accuracy': 0.86, 'eval_f1': 0.8788927335640139, 'eval_runtime': 0.3987, 'eval_samples_per_second': 1254.023, 'eval_steps_per_second': 80.257, 'epoch': 3.0}

Fine-tuning process complete. The resulting model can now be used for inference.


15. The pipeline function simplifies inference by combining the model, tokenizer, and device setup into one interface. The "sentiment-analysis" argument specifies the task type, enabling automatic handling of text preprocessing and output interpretation. The model parameter loads the fine-tuned model still in memory, while the tokenizer ensures consistent text tokenization. The device argument assigns computation to the GPU (0) if available, otherwise defaults to the CPU (-1), optimizing performance during prediction.

In [None]:
# ==============================================
# 8. INFERENCE PIPELINE (TESTING ON NEW DATA)
# ==============================================
sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model=model,            # model still in memory
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

Device set to use cuda:0


16. This code runs sentiment analysis on a set of new, unlabeled text samples. The new_data list contains three example sentences expressing different opinions for the model to classify. The print() statement announces the start of inference, and sentiment_analyzer(new_data) uses the previously created Hugging Face sentiment analysis pipeline to process each text input. The pipeline automatically tokenizes the sentences, runs them through the fine-tuned DistilBERT model, and returns predicted sentiment labels (e.g., POSITIVE or NEGATIVE) along with confidence scores, which are stored in the results variable for later display or analysis.

In [None]:
# Example text data (mimicking real-world project reviews)
new_data = [
    "This system is incredibly slow and completely useless for disaster management.",
    "The accuracy is amazing and the new dashboard makes resource allocation simple.",
    "The committee was very critical of the project's limited scope."
]

print("\n--- Running Inference on Unlabeled Data ---")
results = sentiment_analyzer(new_data)


--- Running Inference on Unlabeled Data ---


17. This code displays the sentiment analysis results for each input text. The for loop iterates through both new_data and their corresponding results using zip(). Inside the loop, it checks the model’s predicted label, if result["label"] equals "LABEL_1", the sentiment is classified as Positive; otherwise, it’s Negative. The script then prints the original text, its sentiment prediction, and the model’s confidence score formatted to four decimal places. After processing all samples, the final print() statements suggest applying this trained sentiment analyzer to a larger dataset for broader or automated sentiment classification tasks.

In [None]:
# Print results
for text, result in zip(new_data, results):
    sentiment = "Positive" if result["label"] == "LABEL_1" else "Negative"
    print(f"\nText: {text}")
    print(f"Prediction: {sentiment} (Score: {result['score']:.4f})")

print("\n--- Next Steps ---")
print("You may now apply this analyzer to your larger dataset for structured sentiment analysis.")


Text: This system is incredibly slow and completely useless for disaster management.
Prediction: Negative (Score: 0.9610)

Text: The accuracy is amazing and the new dashboard makes resource allocation simple.
Prediction: Positive (Score: 0.9727)

Text: The committee was very critical of the project's limited scope.
Prediction: Negative (Score: 0.8739)

--- Next Steps ---
You may now apply this analyzer to your larger dataset for structured sentiment analysis.


In [None]:
# ==============================================
# 1. IMPORT LIBRARIES
# ==============================================
import torch
from datasets import load_dataset
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ==============================================
# 2. DEVICE SETUP (GPU / CPU)
# ==============================================
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")
dataset = load_dataset("sst2")

# Use a smaller subset to simulate an undergraduate project scale
train_data = dataset["train"].select(range(2000))
eval_data = dataset["validation"].select(range(500))

# ==============================================
# 3. TOKENIZER SETUP
# ==============================================
MODEL_NAME = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    # Converts text into token IDs
    return tokenizer(examples["sentence"], truncation=True, padding=True)

# Tokenize the data
tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_eval = eval_data.map(tokenize_function, batched=True)

# Rename label column for Hugging Face Trainer
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

# Convert to PyTorch tensors
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# ==============================================
# 4. MODEL DEFINITION
# ==============================================
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)
print(f"Model loaded: {MODEL_NAME}")

# ==============================================
# 5. METRICS AND TRAINING SETUP
# ==============================================
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="binary")
    return {"accuracy": acc, "f1": f1}

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    eval_strategy="epoch", # Changed from evaluation_strategy
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),  # Use 16-bit precision if GPU is available
    report_to=[]  # Disable W&B logging
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# ==============================================
# 6. EXECUTION - TRAINING
# ==============================================
print("\n--- Starting Fine-Tuning (Expected Time: 1–4 hours on GPU) ---")
trainer.train()

# ==============================================
# 7. FINAL EVALUATION
# ==============================================
print("\n--- Final Evaluation Results ---")
eval_results = trainer.evaluate()
print(eval_results)

# Save best model checkpoint for future inference
trainer.save_model("./sentiment_distilbert_best")

print("\nFine-tuning process complete. The resulting model can now be used for inference.")

# ==============================================
# 8. INFERENCE PIPELINE (TESTING ON NEW DATA)
# ==============================================
from transformers import pipeline

# If restarting Colab, reload saved model path
# MODEL_PATH = "./sentiment_distilbert_best"

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model=model,            # model still in memory
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Example text data (mimicking real-world project reviews)
new_data = [
    "This system is incredibly slow and completely useless for disaster management.",
    "The accuracy is amazing and the new dashboard makes resource allocation simple.",
    "The committee was very critical of the project's limited scope."
]

print("\n--- Running Inference on Unlabeled Data ---")
results = sentiment_analyzer(new_data)

# Print results
for text, result in zip(new_data, results):
    sentiment = "Positive" if result["label"] == "LABEL_1" else "Negative"
    print(f"\nText: {text}")
    print(f"Prediction: {sentiment} (Score: {result['score']:.4f})")

print("\n--- Next Steps ---")
print("You may now apply this analyzer to your larger dataset for structured sentiment analysis.")

Using GPU: Tesla T4

--- Loading and Preprocessing Data ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded: distilbert-base-uncased

--- Starting Fine-Tuning (Expected Time: 1–4 hours on GPU) ---


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.671,0.442949,0.82,0.808511
2,0.3763,0.475448,0.772,0.817308
3,0.2514,0.442256,0.85,0.870912



--- Final Evaluation Results ---


{'eval_loss': 0.44225600361824036, 'eval_accuracy': 0.85, 'eval_f1': 0.8709122203098106, 'eval_runtime': 1.0095, 'eval_samples_per_second': 495.288, 'eval_steps_per_second': 31.698, 'epoch': 3.0}

Fine-tuning process complete. The resulting model can now be used for inference.


Device set to use cuda:0



--- Running Inference on Unlabeled Data ---

Text: This system is incredibly slow and completely useless for disaster management.
Prediction: Negative (Score: 0.9818)

Text: The accuracy is amazing and the new dashboard makes resource allocation simple.
Prediction: Positive (Score: 0.9897)

Text: The committee was very critical of the project's limited scope.
Prediction: Negative (Score: 0.9307)

--- Next Steps ---
You may now apply this analyzer to your larger dataset for structured sentiment analysis.


---

#**Experimentation using RoBERTa**

---


#**Best Performing Model Experiment #1**


**Function Description**

This script fine-tunes a RoBERTa sequence classification model for detecting mental-health-related labels in short social posts, then evaluates and runs inference on new texts; it handles device selection (GPU/CPU), reads a CSV dataset, creates Hugging Face Dataset objects, tokenizes the texts, trains with the Trainer API, evaluates the model, saves the best checkpoint, and exposes a simple pipeline-based inference routine that prints predicted labels and scores for example texts.

**Syntax Explanation**

The code uses PyTorch and Hugging Face Transformers / Datasets: RobertaTokenizerFast tokenizes post_text, RobertaForSequenceClassification loads a pre-trained model (margotwagner/roberta-psychotherapy-eval) and is moved to device, Dataset.from_pandas converts pandas frames to HF datasets, .map() applies a tokenize_function, .set_format("torch", ...) readies tensors for training, TrainingArguments configures training, Trainer orchestrates training/evaluation and accepts a compute_metrics callback that computes accuracy and F1 (binary vs. multi-class handled), and finally the pipeline("sentiment-analysis", ...) uses the trained model for quick inference.

**Inputs**

The main input is the CSV file "Mental-Health-Twitter.csv" expected to contain at least post_text and label columns; the script drops rows with missing post_text or label, casts label to integer, and splits the data into an 80/20 train/eval partition. At runtime the script also consumes the example new_data list for demonstrating inference.

**Outputs**

Outputs shown to the user include device selection messages, dataset shapes and label distribution for the validation set, training logs (loss / checkpoints) produced by the Trainer, final evaluation metrics returned by trainer.evaluate() (accuracy and F1 from compute_metrics), the saved best-model directory ./sentiment_roberta_mental_health_best, and printed predictions with label tokens (e.g., LABEL_1) and confidence scores for the sample inference texts.

**Code Flow**

The script checks for CUDA and reports device, attempts to load the CSV (exits if missing), drops NaNs and coerces labels to int, splits into train/eval dataframes, converts them to HF Datasets, defines a tokenizer function and tokenizes both splits (renaming label→labels and setting torch format), determines num_labels and warns if not binary, loads the RoBERTa model with that num_labels, defines compute_metrics (binary vs. weighted F1), configures TrainingArguments (epochs, batch sizes, fp16 when available, eval/save at epoch), constructs a Trainer, calls trainer.train(), evaluates and saves the best model, then builds a pipeline("sentiment-analysis", ...) for quick inference on new_data and prints mapped predictions.

**Comments and Observations**

This is a solid end-to-end fine-tuning pipeline but a few practical points deserve attention: the script exits on missing CSV which is fine for interactive runs but may be abrupt in automated workflows; padding=True in the tokenizer call pads to the longest batch which is acceptable but consider padding='max_length' with an explicit max_length to stabilize memory use; the code warns correctly when num_labels != 2 but downstream inference and label interpretation (LABEL_1 → Depression) assume a binary mapping that should be made explicit (map indices to human labels); class imbalance, dataset size, and label quality will strongly affect results so inspect value_counts() more deeply and consider stratified splitting; large num_train_epochs=8 and warmup_steps=500 may be fine but monitor overfitting and use smaller eval steps for faster feedback; finally, using the sentiment-analysis pipeline with a domain-specific model is convenient, but confirm the model's label order and meaning before deploying or reporting results.

In [None]:
# ==============================================
# 1. IMPORT LIBRARIES
# ==============================================
import torch
import pandas as pd
from datasets import Dataset
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ==============================================
# 2. DEVICE SETUP (GPU / CPU)
# ==============================================
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")

# ==============================================
# 3. DATA LOADING AND INITIAL PREPROCESSING
# ==============================================
try:
    df = pd.read_csv("Mental-Health-Twitter.csv")
    print(f"Original dataset shape: {df.shape}")
    print(f"Columns in dataset: {df.columns.tolist()}")
except FileNotFoundError:
    print("Error: 'Mental-Health-Twitter.csv' not found. Please ensure the file is in the correct path.")
    exit()

# Filter out rows where 'post_text' or 'label' might be missing
df.dropna(subset=['post_text', 'label'], inplace=True)
print(f"Dataset shape after dropping NaNs: {df.shape}")

# Ensure 'label' column is of integer type for classification
df['label'] = df['label'].astype(int)

# --- CHANGE 5: Create Hugging Face Dataset objects ---
train_df = df.sample(frac=0.8, random_state=42) # 80% for training
eval_df = df.drop(train_df.index)               # Remaining 20% for evaluation

train_dataset = Dataset.from_pandas(train_df[['post_text', 'label']])
eval_dataset = Dataset.from_pandas(eval_df[['post_text', 'label']])

print(f"Training dataset size: {len(train_dataset)}")
print(f"Evaluation dataset size: {len(eval_dataset)}")

# ==============================================
# 4. TOKENIZER SETUP
# ==============================================
MODEL_NAME = "margotwagner/roberta-psychotherapy-eval"
tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    # Converts text into token IDs using 'post_text' column
    return tokenizer(examples["post_text"], truncation=True, padding=True)

# Tokenize the data
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

# Rename label column for Hugging Face Trainer
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

# Convert to PyTorch tensors
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
print(f"\nValidation Set Label Distribution (before tokenization):")
print(eval_df['label'].value_counts())
print(f"Total Validation Samples: {len(eval_df)}")
tokenized_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# ==============================================
# 5. MODEL DEFINITION
# ==============================================
# Get the number of unique labels from your dataset to pass to num_labels
num_labels_in_dataset = df['label'].nunique()
if num_labels_in_dataset != 2:
    print(f"Warning: Your dataset has {num_labels_in_dataset} unique labels, but the model is typically for binary classification. Ensure your 'label' column is correctly mapped to 0 and 1.")

model = RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels_in_dataset).to(device)
print(f"Model loaded: {MODEL_NAME} with {num_labels_in_dataset} labels.")

# ==============================================
# 6. METRICS AND TRAINING SETUP
# ==============================================
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    # --- CHANGE 6: f1_score needs 'binary' if num_labels is 2, otherwise 'weighted' or 'macro' for multi-class ---
    if num_labels_in_dataset == 2:
        f1 = f1_score(p.label_ids, preds, average="binary")
    else:
        f1 = f1_score(p.label_ids, preds, average="weighted") # Use 'weighted' for multi-class
    acc = accuracy_score(p.label_ids, preds)
    return {"accuracy": acc, "f1": f1}

training_args = TrainingArguments(
    output_dir="./results_roberta_mental_health",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.03,
    logging_dir="./logs_roberta_mental_health",
    logging_steps=50,
    save_total_limit=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# ==============================================
# 7. EXECUTION - TRAINING
# ==============================================
print(f"\n--- Starting Fine-Tuning RoBERTa on {len(train_dataset)} samples ---")
trainer.train()

# ==============================================
# 8. FINAL EVALUATION
# ==============================================
print("\n--- Final Evaluation Results for RoBERTa ---")
eval_results = trainer.evaluate()
print(eval_results)

# Save best model checkpoint for future inference
trainer.save_model("./sentiment_roberta_mental_health_best")

print("\nFine-tuning process complete. The resulting RoBERTa model can now be used for inference.")

# ==============================================
# 9. INFERENCE PIPELINE (TESTING ON NEW DATA)
# ==============================================
from transformers import pipeline

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Example text data (mimicking new tweets)
new_data = [
    "I've been feeling really down lately, can't seem to shake this sadness.",
    "Today was a good day, felt a bit more positive and hopeful.",
    "Sometimes it's hard to get out of bed, the weight of everything is just too much.",
    "Connecting with friends really helps lift my spirits."
]

print("\n--- Running Inference on Unlabeled Data with RoBERTa (Testing Inverted Logic) ---")
results = sentiment_analyzer(new_data)

for text, result in zip(new_data, results):
    predicted_status = "Depression" if result["label"] == "LABEL_1" else "No Depression"

    print(f"\nText: {text}")
    print(f"Prediction: {predicted_status} (Score: {result['score']:.4f})")

Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Original dataset shape: (20000, 11)
Columns in dataset: ['Unnamed: 0', 'post_id', 'post_created', 'post_text', 'user_id', 'followers', 'friends', 'favourites', 'statuses', 'retweets', 'label']
Dataset shape after dropping NaNs: (20000, 11)
Training dataset size: 16000
Evaluation dataset size: 4000


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]


Validation Set Label Distribution (before tokenization):
label
0    2044
1    1956
Name: count, dtype: int64
Total Validation Samples: 4000


config.json:   0%|          | 0.00/886 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Model loaded: margotwagner/roberta-psychotherapy-eval with 2 labels.

--- Starting Fine-Tuning RoBERTa on 16000 samples ---


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.25,0.268535,0.90575,0.907801
2,0.2162,0.223305,0.924,0.924152



--- Final Evaluation Results for RoBERTa ---


{'eval_loss': 0.2233051210641861, 'eval_accuracy': 0.924, 'eval_f1': 0.9241516966067864, 'eval_runtime': 4.8469, 'eval_samples_per_second': 825.274, 'eval_steps_per_second': 51.58, 'epoch': 2.0}

Fine-tuning process complete. The resulting RoBERTa model can now be used for inference.


Device set to use cuda:0



--- Running Inference on Unlabeled Data with RoBERTa (Testing Inverted Logic) ---

Text: I've been feeling really down lately, can't seem to shake this sadness.
Prediction: No Depression (Score: 0.9414)

Text: Today was a good day, felt a bit more positive and hopeful.
Prediction: No Depression (Score: 0.9524)

Text: Sometimes it's hard to get out of bed, the weight of everything is just too much.
Prediction: No Depression (Score: 0.5228)

Text: Connecting with friends really helps lift my spirits.
Prediction: No Depression (Score: 0.7377)


#**Best Performing Model Experiment #2**


In this updated configuration, new hyperparameters such as learning_rate, warmup_ratio, metric_for_best_model, and greater_is_better were introduced. The learning_rate (2e-5) controls how much the model’s weights are updated during training, using a smaller value helps achieve more stable and precise learning, preventing overshooting in fine-tuning tasks. The warmup_ratio (0.2) means 20% of the total training steps are used for gradually increasing the learning rate from zero to the target value, which helps the model stabilize early in training and avoid sudden gradient spikes. The metric_for_best_model ("f1") specifies that the F1 score will determine which model checkpoint is considered best, while greater_is_better (True) tells the Trainer to select the model with the higher F1 value. These additions collectively improve training stability, help the model generalize better, and ensure the saved checkpoint is optimized based on a balanced precision-recall performance rather than just accuracy.

In [None]:
# ==============================================
# 1. IMPORT LIBRARIES
# ==============================================
import torch
import pandas as pd
from datasets import Dataset
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# ==============================================
# 2. DEVICE SETUP (GPU / CPU)
# ==============================================
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

print("\n--- Loading and Preprocessing Data ---")

# ==============================================
# 3. DATA LOADING AND INITIAL PREPROCESSING
# ==============================================
try:
    df = pd.read_csv("Mental-Health-Twitter.csv")
    print(f"Original dataset shape: {df.shape}")
    print(f"Columns in dataset: {df.columns.tolist()}")
except FileNotFoundError:
    print("Error: 'Mental-Health-Twitter.csv' not found. Please ensure the file is in the correct path.")
    exit()

# Filter out rows where 'post_text' or 'label' might be missing
df.dropna(subset=['post_text', 'label'], inplace=True)
print(f"Dataset shape after dropping NaNs: {df.shape}")

# Ensure 'label' column is of integer type for classification
df['label'] = df['label'].astype(int)

# --- CHANGE 5: Create Hugging Face Dataset objects ---
train_df = df.sample(frac=0.8, random_state=42) # 80% for training
eval_df = df.drop(train_df.index)               # Remaining 20% for evaluation

train_dataset = Dataset.from_pandas(train_df[['post_text', 'label']])
eval_dataset = Dataset.from_pandas(eval_df[['post_text', 'label']])

print(f"Training dataset size: {len(train_dataset)}")
print(f"Evaluation dataset size: {len(eval_dataset)}")

# ==============================================
# 4. TOKENIZER SETUP
# ==============================================
MODEL_NAME = "margotwagner/roberta-psychotherapy-eval"
tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    # Converts text into token IDs using 'post_text' column
    return tokenizer(examples["post_text"], truncation=True, padding=True)

# Tokenize the data
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

# Rename label column for Hugging Face Trainer
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

# Convert to PyTorch tensors
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
print(f"\nValidation Set Label Distribution (before tokenization):")
print(eval_df['label'].value_counts())
print(f"Total Validation Samples: {len(eval_df)}")
tokenized_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# ==============================================
# 5. MODEL DEFINITION
# ==============================================
# Get the number of unique labels from your dataset to pass to num_labels
num_labels_in_dataset = df['label'].nunique()
if num_labels_in_dataset != 2:
    print(f"Warning: Your dataset has {num_labels_in_dataset} unique labels, but the model is typically for binary classification. Ensure your 'label' column is correctly mapped to 0 and 1.")

model = RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels_in_dataset).to(device)
print(f"Model loaded: {MODEL_NAME} with {num_labels_in_dataset} labels.")

# ==============================================
# 6. METRICS AND TRAINING SETUP
# ==============================================
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    # --- CHANGE 6: f1_score needs 'binary' if num_labels is 2, otherwise 'weighted' or 'macro' for multi-class ---
    if num_labels_in_dataset == 2:
        f1 = f1_score(p.label_ids, preds, average="binary")
    else:
        f1 = f1_score(p.label_ids, preds, average="weighted") # Use 'weighted' for multi-class
    acc = accuracy_score(p.label_ids, preds)
    return {"accuracy": acc, "f1": f1}

training_args = TrainingArguments(
    output_dir="./results_roberta_mental_health",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.2,
    weight_decay=0.01,
    logging_dir="./logs_roberta_mental_health",
    logging_steps=50,
    save_total_limit=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    report_to=["none"],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# ==============================================
# 7. EXECUTION - TRAINING
# ==============================================
print(f"\n--- Starting Fine-Tuning RoBERTa on {len(train_dataset)} samples ---")
trainer.train()

# ==============================================
# 8. FINAL EVALUATION
# ==============================================
print("\n--- Final Evaluation Results for RoBERTa ---")
eval_results = trainer.evaluate()
print(eval_results)

# Save best model checkpoint for future inference
trainer.save_model("./sentiment_roberta_mental_health_best")

print("\nFine-tuning process complete. The resulting RoBERTa model can now be used for inference.")

# ==============================================
# 9. INFERENCE PIPELINE (TESTING ON NEW DATA)
# ==============================================
from transformers import pipeline

sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Example text data (mimicking new tweets)
new_data = [
    "I've been feeling really down lately, can't seem to shake this sadness.",
    "Today was a good day, felt a bit more positive and hopeful.",
    "Sometimes it's hard to get out of bed, the weight of everything is just too much.",
    "Connecting with friends really helps lift my spirits."
]

print("\n--- Running Inference on Unlabeled Data with RoBERTa (Testing Inverted Logic) ---")
results = sentiment_analyzer(new_data)

for text, result in zip(new_data, results):
    predicted_status = "Depression" if result["label"] == "LABEL_1" else "No Depression"

    print(f"\nText: {text}")
    print(f"Prediction: {predicted_status} (Score: {result['score']:.4f})")

Using GPU: Tesla T4

--- Loading and Preprocessing Data ---
Original dataset shape: (20000, 11)
Columns in dataset: ['Unnamed: 0', 'post_id', 'post_created', 'post_text', 'user_id', 'followers', 'friends', 'favourites', 'statuses', 'retweets', 'label']
Dataset shape after dropping NaNs: (20000, 11)
Training dataset size: 16000
Evaluation dataset size: 4000


Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]


Validation Set Label Distribution (before tokenization):
label
0    2044
1    1956
Name: count, dtype: int64
Total Validation Samples: 4000
Model loaded: margotwagner/roberta-psychotherapy-eval with 2 labels.

--- Starting Fine-Tuning RoBERTa on 16000 samples ---


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2582,0.330939,0.8775,0.87817
2,0.4083,0.256082,0.904,0.906296
3,0.2263,0.374609,0.9165,0.913873
4,0.1874,0.369773,0.925,0.922561
5,0.1258,0.456512,0.927,0.926225
6,0.0826,0.507889,0.92125,0.917647
7,0.0319,0.568827,0.93,0.928862
8,0.0353,0.532781,0.93325,0.932216
9,0.0505,0.671286,0.9295,0.929323
10,0.0,0.635421,0.93475,0.934141



--- Final Evaluation Results for RoBERTa ---


{'eval_loss': 0.6354213356971741, 'eval_accuracy': 0.93475, 'eval_f1': 0.9341408024224073, 'eval_runtime': 4.7652, 'eval_samples_per_second': 839.418, 'eval_steps_per_second': 52.464, 'epoch': 10.0}


Device set to use cuda:0



Fine-tuning process complete. The resulting RoBERTa model can now be used for inference.

--- Running Inference on Unlabeled Data with RoBERTa (Testing Inverted Logic) ---

Text: I've been feeling really down lately, can't seem to shake this sadness.
Prediction: No Depression (Score: 1.0000)

Text: Today was a good day, felt a bit more positive and hopeful.
Prediction: No Depression (Score: 1.0000)

Text: Sometimes it's hard to get out of bed, the weight of everything is just too much.
Prediction: No Depression (Score: 1.0000)

Text: Connecting with friends really helps lift my spirits.
Prediction: No Depression (Score: 1.0000)
