# Drug Review Classification with BioBERT

This notebook demonstrates a text classification pipeline using the Drugs.com dataset. The goal is to predict the medical condition associated with a drug based on its name.

The workflow includes:

1. Loading and splitting the dataset into training, validation, and test sets.  
2. Filtering for the top 6 most frequent conditions and encoding labels for classification.  
3. Tokenizing drug names using BioBERT (`dmis-lab/biobert-base-cased-v1.1`) and preparing the data for training.  
4. Fine-tuning a BioBERT-based classifier with Hugging Face `Trainer`.  
5. Evaluating model performance using accuracy, precision, recall, and F1 score.  
6. Running predictions on the test set and mapping predicted labels back to conditions.

This notebook focuses on a clean, end-to-end pipeline for sequence classification while demonstrating reproducible preprocessing, model training, evaluation, and inference.


In [1]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2025-08-14 18:09:07--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [              <=>   ]  41.00M  12.2MB/s    in 3.4s    

2025-08-14 18:09:11 (12.2 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


We load the files using the load_dataset function.

In [2]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In order to get a quick feel for the type of data we're working with, we create a random sample and print the first few examples.

In [3]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

The 'Unnamed: 0' columns looks like an anonymized ID.

 To test the patient ID hypothesis for the 'Unnamed: 0' column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:

In [4]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to confirm our hypothesis, so let’s clean up the dataset a bit by renaming the Unnamed: 0 column to 'patient_id'.

In [5]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

Next, we split our training set into train and validation splits.

In [6]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 129037
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 32260
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

We’ve now prepared a dataset that’s ready for training our model.

We now keep only the variables that we are interested to.

In [7]:
from datasets import DatasetDict

# Keep only 'drugName' and 'condition' to train the trainer
drug_dataset_filtered = drug_dataset_clean.remove_columns(
    [col for col in drug_dataset["train"].column_names if col not in ["drugName", "condition"]]
)

print(drug_dataset_filtered)

DatasetDict({
    train: Dataset({
        features: ['drugName', 'condition'],
        num_rows: 129037
    })
    validation: Dataset({
        features: ['drugName', 'condition'],
        num_rows: 32260
    })
    test: Dataset({
        features: ['drugName', 'condition'],
        num_rows: 53766
    })
})


We prepare the dataset for model training:

1. **Tokenizer initialization**: Loads the `dmis-lab/biobert-base-cased-v1.1` tokenizer to convert drug names into token IDs suitable for BioBERT.  
2. **Top condition selection**: Identifies the 6 most frequent medical conditions in the training set and filters the dataset to keep only these conditions.  
3. **Label encoding**: Uses `LabelEncoder` to convert condition names into numerical labels for classification.  
4. **Tokenization**: Tokenizes the `drugName` column for all examples.  
5. **Data collation**: Sets up a `DataCollatorWithPadding` to automatically pad tokenized inputs during training.  

This prepares both the inputs and labels in the format required by the Hugging Face `Trainer`.


In [8]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from sklearn.preprocessing import LabelEncoder
from collections import Counter

checkpoint = "dmis-lab/biobert-base-cased-v1.1"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

train_conditions = drug_dataset_filtered["train"]["condition"]
top_conditions = [cond for cond, count in Counter(train_conditions).most_common(6)]

drug_dataset_filtered = drug_dataset_filtered.filter(lambda x: x["condition"] in top_conditions)

le = LabelEncoder()
le.fit(drug_dataset_filtered["train"]["condition"])
drug_dataset_filtered = drug_dataset_filtered.map(lambda x: {'labels': le.transform(x['condition'])}, batched=True)


def tokenize_function(example):
    return tokenizer(example["drugName"], truncation=True)

tokenized_datasets = drug_dataset_filtered.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Filter:   0%|          | 0/129037 [00:00<?, ? examples/s]

Filter:   0%|          | 0/32260 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

Map:   0%|          | 0/47796 [00:00<?, ? examples/s]

Map:   0%|          | 0/11922 [00:00<?, ? examples/s]

Map:   0%|          | 0/19978 [00:00<?, ? examples/s]

Map:   0%|          | 0/47796 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/11922 [00:00<?, ? examples/s]

Map:   0%|          | 0/19978 [00:00<?, ? examples/s]

We import the arguments that we are going to use during training and select a model (BioBERT) for sequence classification.

In [9]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer",
                                  eval_strategy="epoch",
                                  fp16=True,
                                  num_train_epochs=2
                                  )

In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=6)

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

We define a function to compute evaluation metrics during training/evaluation. We use the "accuracy" metric from Hugging Face's "evaluate" library.

In [14]:
import numpy as np
import evaluate

def compute_metrics(eval_preds):
    metric = evaluate.load("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


We initialize the Hugging Face Trainer object.

In [15]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


We initialize a Weights & Biases (wandb) run for experiment tracking under the project "drug-reviews-classification", then start training the model using the Trainer.


In [16]:
import wandb

wandb.init(project="drug-reviews-classification")

trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mstavrosvlach34[0m ([33mstavrosvlach34-personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.248,0.234761,0.917296
2,0.2039,0.216394,0.918722


Downloading builder script: 0.00B [00:00, ?B/s]

TrainOutput(global_step=11950, training_loss=0.26433155841907197, metrics={'train_runtime': 1383.2659, 'train_samples_per_second': 69.106, 'train_steps_per_second': 8.639, 'total_flos': 620076359699712.0, 'train_loss': 0.26433155841907197, 'epoch': 2.0})

Generate predictions on the validation set and compute evaluation metrics (Accuracy, Precision, Recall, and F1-score) to assess model performance.

In [17]:
val_dataset = tokenized_datasets["validation"]

predictions_output = trainer.predict(val_dataset)
logits = predictions_output.predictions
labels = predictions_output.label_ids

predictions = np.argmax(logits, axis=-1)

In [18]:
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")
f1_metric = evaluate.load("f1")
accuracy_metric = evaluate.load("accuracy")

precision = precision_metric.compute(predictions=predictions, references=labels, average="macro")
recall = recall_metric.compute(predictions=predictions, references=labels, average="macro")
f1 = f1_metric.compute(predictions=predictions, references=labels, average="macro")
accuracy = accuracy_metric.compute(predictions=predictions, references=labels)

print(f"Accuracy: {accuracy['accuracy']:.4f}")
print(f"Precision: {precision['precision']:.4f}")
print(f"Recall: {recall['recall']:.4f}")
print(f"F1 Score: {f1['f1']:.4f}")

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Accuracy: 0.9187
Precision: 0.8970
Recall: 0.8780
F1 Score: 0.8830


We tokenize the 'drugName' field from the test set using the same tokenizer as during training.

In [20]:
tokenized_test = drug_dataset_filtered["test"].map(
    lambda x: tokenizer(x["drugName"], truncation=True),
    batched=True,
)


Map:   0%|          | 0/19978 [00:00<?, ? examples/s]

We use the trained model to make predictions on the tokenized test set.

In [21]:
predictions_output = trainer.predict(tokenized_test)
logits = predictions_output.predictions


We convert raw logits to predicted class indices by selecting the highest logit for each example.

In [22]:
predicted_class_ids = np.argmax(logits, axis=-1)

In [23]:
predicted_conditions = le.inverse_transform(predicted_class_ids)

We display the first 5 predictions from the test set.

In [24]:
count = 0
for drug, condition in zip(tokenized_test["drugName"], predicted_conditions):
    print(f"Drug: {drug} -> Predicted condition: {condition}")
    count += 1
    if count == 5:
      break

Drug: Mirtazapine -> Predicted condition: Anxiety
Drug: Cyclafem 1 / 35 -> Predicted condition: Birth Control
Drug: Copper -> Predicted condition: Birth Control
Drug: Levora -> Predicted condition: Birth Control
Drug: Microgestin Fe 1 / 20 -> Predicted condition: Birth Control
