**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [23]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from sklearn.preprocessing import RobustScaler


### 1. Load the data

In [24]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [55]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

answer_to_life = 42

def preprocess(df: pd.DataFrame, frac : float = 1, label_map : dict[int, str] = label_map, seed : int = answer_to_life) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)
    )

# Use bigger fraction and remove fraction usage
train_df = preprocess(train)
test_df = preprocess(test)

train_df.shape, test_df.shape

((120000, 2), (7600, 2))

![Imgur Image](https://i.imgur.com/FqFtqpW.png)

In [56]:
(
    X_train,
    X_val,
    y_train,
    y_val
) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)


# analyze bigrams and unigrams
cv = CountVectorizer(ngram_range=(1,2))
X_train_vectorized = cv.fit_transform(X_train)


In [57]:


lr_clf = LogisticRegression(C=0.1, penalty='l2',solver='saga', max_iter=1000)
# Use saga solver and increase iterations to converge regression
# add penalty for generalizability

lr_clf.fit(X_train_vectorized, y_train)


In [50]:
X_val_vectorized = cv.transform(X_val) 

y_pred = lr_clf.predict(X_val_vectorized)

test_df_vectorized = cv.transform(test_df["text"])


In [51]:

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       0.99      0.99      0.99      9611
      Sports       0.99      1.00      0.99      9625
    Business       0.99      1.00      1.00      9559
    Sci/Tech       1.00      0.99      0.99      9605

    accuracy                           0.99     38400
   macro avg       0.99      0.99      0.99     38400
weighted avg       0.99      0.99      0.99     38400

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.88      0.86      0.87      2389
      Sports       0.86      0.88      0.87      2375
    Business       0.94      0.97      0.96      2441
    Sci/Tech       0.92      0.89      0.90      2395

    accuracy                           0.90      9600
   macro avg       0.90      0.90      0.90      9600
weighted avg       0.90      0.90      0.90      9600

Performance on the test set:
              precision    recall