**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [1]:
# imports for the project

from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [3]:
import torch

# Tjek om MPS (Mac GPU) er tilgængeligt
print("MPS tilgængelig:", torch.backends.mps.is_available())

# Tjek om PyTorch kan bruge MPS
print("MPS build:", torch.backends.mps.is_built())


MPS tilgængelig: True
MPS build: True


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [4]:
TRAIN_SIZE = 1 # percent as whole number
TEST_SIZE = 10 # percent as whole number

In [5]:
ag_news_train = load_dataset("fancyzhx/ag_news", split=f"train[:{TRAIN_SIZE}%]",)  # n% of training data
ag_news_test = load_dataset("fancyzhx/ag_news", split=f"test[:{TEST_SIZE}%]")  # n% of test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 760
    })
})

### 3.1. Load ModernBERT pipeline

We can make use for the `pipeline` function from the HuggingFace `Transformers` library to load both the model and tokenizer for ModernBERT. This pipeline will automatically tokenize the input text and prepare it for the model. Note that we don't need to load the model itself yet, just the tokenizer, as we will use the pipeline for tokenization and preprocessing. We indicate that by setting task=`feature-extraction`.

In [6]:
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

Device set to use mps:0


### 3.2. Encode the data

In this step, we’ll convert each text in our dataset into a numerical representation that machine learning models can understand. To do this, we use the `ModernBERT` model, which takes in text and produces embeddings — vectors of numbers that capture the meaning of the text.

In [7]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=32)

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/760 [00:00<?, ? examples/s]

In [None]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=32)


Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/760 [00:00<?, ? examples/s]

In [None]:
ag_news.save_to_disk("test.hf")

Saving the dataset (0/1 shards):   0%|          | 0/1200 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/760 [00:00<?, ? examples/s]

Then, we can extract features and labels into X_train, y_train, X_test, y_test to fit with the standard scikit-learn paradigm.

In [10]:

X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (1200, 768), y_train shape: (1200,)
X_test shape: (760, 768), y_test shape: (760,)


### Hyperparameter
Experimenting with different hyperparameters in Logistic Regression within the grid:

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "C": [0.01, 0.1, 1, 10],  # Regularization strength
    "solver": ["liblinear", "lbfgs"],  # Different solvers
    "max_iter": [500, 1000]  # Increase iterations for convergence
}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=3, scoring="accuracy", verbose=2)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")


Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] END .............C=0.01, max_iter=500, solver=liblinear; total time=   0.3s
[CV] END .............C=0.01, max_iter=500, solver=liblinear; total time=   0.4s
[CV] END .............C=0.01, max_iter=500, solver=liblinear; total time=   0.5s
[CV] END .................C=0.01, max_iter=500, solver=lbfgs; total time=   0.8s
[CV] END .................C=0.01, max_iter=500, solver=lbfgs; total time=   0.6s
[CV] END .................C=0.01, max_iter=500, solver=lbfgs; total time=   0.7s
[CV] END ............C=0.01, max_iter=1000, solver=liblinear; total time=   0.3s
[CV] END ............C=0.01, max_iter=1000, solver=liblinear; total time=   0.3s
[CV] END ............C=0.01, max_iter=1000, solver=liblinear; total time=   0.3s
[CV] END ................C=0.01, max_iter=1000, solver=lbfgs; total time=   0.5s
[CV] END ................C=0.01, max_iter=1000, solver=lbfgs; total time=   0.6s
[CV] END ................C=0.01, max_iter=1000, 

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ..................C=0.1, max_iter=500, solver=lbfgs; total time=   1.5s
[CV] END ..................C=0.1, max_iter=500, solver=lbfgs; total time=   1.0s
[CV] END ..................C=0.1, max_iter=500, solver=lbfgs; total time=   2.0s
[CV] END .............C=0.1, max_iter=1000, solver=liblinear; total time=   0.5s
[CV] END .............C=0.1, max_iter=1000, solver=liblinear; total time=   0.5s
[CV] END .............C=0.1, max_iter=1000, solver=liblinear; total time=   0.5s
[CV] END .................C=0.1, max_iter=1000, solver=lbfgs; total time=   1.6s
[CV] END .................C=0.1, max_iter=1000, solver=lbfgs; total time=   0.5s
[CV] END .................C=0.1, max_iter=1000, solver=lbfgs; total time=   0.5s
[CV] END ................C=1, max_iter=500, solver=liblinear; total time=   0.8s
[CV] END ................C=1, max_iter=500, solver=liblinear; total time=   0.7s
[CV] END ................C=1, max_iter=500, solver=liblinear; total time=   0.7s
[CV] END ...................

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ....................C=1, max_iter=500, solver=lbfgs; total time=   1.5s
[CV] END ...............C=1, max_iter=1000, solver=liblinear; total time=   0.8s
[CV] END ...............C=1, max_iter=1000, solver=liblinear; total time=   0.7s
[CV] END ...............C=1, max_iter=1000, solver=liblinear; total time=   0.7s
[CV] END ...................C=1, max_iter=1000, solver=lbfgs; total time=   1.2s
[CV] END ...................C=1, max_iter=1000, solver=lbfgs; total time=   2.6s
[CV] END ...................C=1, max_iter=1000, solver=lbfgs; total time=   1.4s
[CV] END ...............C=10, max_iter=500, solver=liblinear; total time=   0.9s
[CV] END ...............C=10, max_iter=500, solver=liblinear; total time=   1.0s
[CV] END ...............C=10, max_iter=500, solver=liblinear; total time=   1.0s


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ...................C=10, max_iter=500, solver=lbfgs; total time=   1.0s


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ...................C=10, max_iter=500, solver=lbfgs; total time=   1.5s


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ...................C=10, max_iter=500, solver=lbfgs; total time=   1.3s
[CV] END ..............C=10, max_iter=1000, solver=liblinear; total time=   0.9s
[CV] END ..............C=10, max_iter=1000, solver=liblinear; total time=   1.0s
[CV] END ..............C=10, max_iter=1000, solver=liblinear; total time=   1.0s
[CV] END ..................C=10, max_iter=1000, solver=lbfgs; total time=   1.0s
[CV] END ..................C=10, max_iter=1000, solver=lbfgs; total time=   1.0s
[CV] END ..................C=10, max_iter=1000, solver=lbfgs; total time=   1.5s
Best params: {'C': 1, 'max_iter': 500, 'solver': 'liblinear'}


### 3.3. Train a classifier

In [12]:
# Brug de bedste fundne hyperparametre
best_params = grid_search.best_params_
lr = LogisticRegression(**best_params)
lr.fit(X_train, y_train)


In [13]:
y_pred_train = lr.predict(X_train)

print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       273
           1       0.99      1.00      1.00       182
           2       0.99      0.98      0.98       202
           3       0.99      0.99      0.99       543

    accuracy                           0.99      1200
   macro avg       0.99      0.99      0.99      1200
weighted avg       0.99      0.99      0.99      1200



### 3.4. Make predictions

In [14]:
y_pred = lr.predict(X_test)

In [15]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.76      0.76       197
           1       0.91      0.87      0.89       199
           2       0.70      0.60      0.65       158
           3       0.75      0.86      0.81       206

    accuracy                           0.79       760
   macro avg       0.78      0.78      0.78       760
weighted avg       0.79      0.79      0.78       760



## Evaluation and Comparison of BoW vs BERT

BERT Model Performance:

Training Set:
Accuracy: 99%
The model performs extremely well on the training set, with almost perfect precision, recall, and F1-score for all categories.

Test Set:
Accuracy: 79%
The model struggles a bit more on the test set, but still maintains a solid accuracy of 79%.


Bag-of-Words (BoW) Model Performance:

Training Set Accuracy: 100%
Test Set Accuracy: 83%
Indicats that the model is highly overfitting to the training data.


## Comparison:
BERT considers the entire context of a sentence, meaning it understands how words relate to each other based on their position and surrounding words. This enables it to capture more meaningful representations of the text.

BoW, on the other hand, treats each word independently and focuses purely on word frequency, ignoring word order and context. It can perform very well on the training set, especially when it overfits to common words, but it struggles with unseen or complex word relationships.

BERT is more generalized in the sense that it has been pre-trained on vast amounts of data, enabling it to recognize and understand words in a wide variety of contexts. This means that BERT has the potential to generalize better to unseen data, even though its accuracy might be lower on the current test set. It can handle subtle variations in language more effectively than BoW, which only looks at word frequency.

The BoW model tends to overfit because it learns directly from the frequency of words in the training data, which may not generalize well to new data. It performs better on the training set but struggles with new or unseen patterns.

In summary, BERT might show lower accuracy in this specific case, but its architecture gives it an edge in handling more complex, unseen data by understanding the meaning behind the words.
