**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [13]:
# imports for the project
from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.naive_bayes import MultinomialNB

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [14]:
TRAIN_SIZE = 1 # percent as whole number
TEST_SIZE = 10 # percent as whole number

In [15]:
ag_news_train = load_dataset("fancyzhx/ag_news", split=f"train[:{TRAIN_SIZE}%]", keep_in_memory=True )  # n% of training data
ag_news_test = load_dataset("fancyzhx/ag_news", split=f"test[:{TEST_SIZE}%]", keep_in_memory=True)  # n% of test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 760
    })
})

In [16]:
# Load the pipeline
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

Device set to use cpu


In [17]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/760 [00:00<?, ? examples/s]

In [18]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 760
    })
})

In [19]:

X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (1200, 768), y_train shape: (1200,)
X_test shape: (760, 768), y_test shape: (760,)


In [23]:
lr_clf = LogisticRegression(max_iter=1000)
# Fit the model on the training data
lr_clf.fit(X_train, y_train)

In [25]:
y_pred_train = lr_clf.predict(X_train)
print(classification_report(y_train, y_pred_train))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00       273
           1       1.00      1.00      1.00       182
           2       1.00      0.99      1.00       202
           3       1.00      1.00      1.00       543

    accuracy                           1.00      1200
   macro avg       1.00      1.00      1.00      1200
weighted avg       1.00      1.00      1.00      1200



In [26]:
y_pred = lr_clf.predict(X_test)

### Final Results:

In [35]:
# Predict on training and test sets for the current model (Logistic Regression)
y_train_pred_lr = lr_clf.predict(X_train)
y_test_pred_lr = lr_clf.predict(X_test)

# Predict on training and test sets for MultinomialNB (if applicable)

# Calculate accuracy scores
train_accuracy_lr = accuracy_score(y_train, y_train_pred_lr)
test_accuracy_lr = accuracy_score(y_test, y_test_pred_lr)


# Print classification reports for the test set
print("Logistic Regression - Test Performance:")
print(classification_report(y_test, y_test_pred_lr, target_names=label_map.values()))


Logistic Regression - Test Performance:
              precision    recall  f1-score   support

       World       0.75      0.76      0.75       197
      Sports       0.91      0.87      0.89       199
    Business       0.66      0.59      0.62       158
    Sci/Tech       0.74      0.82      0.78       206

    accuracy                           0.77       760
   macro avg       0.76      0.76      0.76       760
weighted avg       0.77      0.77      0.77       760



### Confusion matrix

In [34]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
import pandas as pd

# Define label map
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

# Predict on the test set
y_pred = lr_clf.predict(X_test)

# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Convert the confusion matrix to a DataFrame with descriptive headers
conf_matrix_df = pd.DataFrame(
    conf_matrix,
    index=[f" {label}" for label in label_map.values()],
    columns=[f" {label}" for label in label_map.values()]
)

# Calculate overall test accuracy
overall_accuracy = accuracy_score(y_test, y_pred)

# Calculate per-class precision, recall, and accuracy
per_class_precision = precision_score(y_test, y_pred, average=None)
per_class_recall = recall_score(y_test, y_pred, average=None)
per_class_accuracy = conf_matrix.diagonal() / conf_matrix.sum(axis=1)

# Display the confusion matrix
print("Confusion Matrix:")
print(conf_matrix_df)

# Display overall accuracy
print(f"\nOverall Test Accuracy: {overall_accuracy:.2f}")

# Display per-class metrics
print("\nPer-Class Metrics:")
for label, precision, recall, accuracy in zip(label_map.values(), per_class_precision, per_class_recall, per_class_accuracy):
    print(f"{label}: Precision: {precision:.2f}, Recall: {recall:.2f}, Accuracy: {accuracy:.2f}")

Confusion Matrix:
            World   Sports   Business   Sci/Tech
 World        149       15         16         17
 Sports        16      174          5          4
 Business      25        1         93         39
 Sci/Tech       9        1         27        169

Overall Test Accuracy: 0.77

Per-Class Metrics:
World: Precision: 0.75, Recall: 0.76, Accuracy: 0.76
Sports: Precision: 0.91, Recall: 0.87, Accuracy: 0.87
Business: Precision: 0.66, Recall: 0.59, Accuracy: 0.59
Sci/Tech: Precision: 0.74, Recall: 0.82, Accuracy: 0.82


Basic observations:
1. World:
"World" articles achieve a moderate accuracy of 76%, with the model correctly classifying 149 instances. However, a notable number of these articles are misclassified as other categories—15 as "Sports," 16 as "Business," and 17 as "Sci/Tech." This spread of errors indicates that global news often contains diverse topics that overlap with specialized domains, making it challenging for the model to pinpoint the correct label.

2. Sports:
"Sports" articles are classified with strong performance, reflected by an accuracy of 87%. The specialized vocabulary in sports content appears to be distinctive, as evidenced by a high precision of 0.91 and recall of 0.87. Nonetheless, there is still some confusion with the "World" category, where 16 sports articles are misclassified, possibly due to international sports events that also carry global significance.

3. Business:
"Business" articles are the most problematic, with the lowest accuracy of 59% and a recall of 0.59. A significant portion of these articles are misclassified—25 as "World" and 39 as "Sci/Tech" (with only 1 misclassified as "Sports"). This suggests that the vocabulary in business news may not be distinctive enough or often overlaps with technological content and global issues, leading to considerable misclassification.

4. Sci/Tech:
"Sci/Tech" articles show balanced performance, achieving an accuracy of 82% with a recall of 0.82 and precision of 0.74. However, the model still confuses some Sci/Tech articles—9 are misclassified as "World" and 27 as "Business". This pattern indicates that technological content, particularly when it involves business implications, is challenging for the model to differentiate clearly from pure business news or broader global topics.