**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [58]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from sklearn.preprocessing import RobustScaler


### 1. Load the data

In [59]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [None]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

answer_to_life = 42

def preprocess(df: pd.DataFrame, frac : float = 1, label_map : dict[int, str] = label_map, seed : int = answer_to_life) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)
    )

# Use bigger fraction and remove fraction usage
train_df = preprocess(train)
test_df = preprocess(test)

train_df.shape, test_df.shape

((12000, 2), (760, 2))

![Imgur Image](https://i.imgur.com/FqFtqpW.png)

In [61]:
(
    X_train,
    X_val,
    y_train,
    y_val
) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)


# analyze bigrams and unigrams
cv = CountVectorizer(ngram_range=(1,2))
X_train_vectorized = cv.fit_transform(X_train)


In [None]:
param_grid = [
    {
        'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
        'C': [0.01, 0.1, 1],
        'solver': ['lbfgs', 'newton-cg', 'liblinear', 'sag', 'saga']
    }
]

lr_clf = LogisticRegression(max_iter=1000)

gs_clf = GridSearchCV(lr_clf, param_grid=param_grid, cv=3)

best_clf = gs_clf.fit(X_train_vectorized, y_train)

117 fits failed out of a total of 180.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
9 fits failed with the following error:
Traceback (most recent call last):
  File "/home/pz/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/pz/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pz/anaconda3/envs/aiml25-ma2/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty, se

In [None]:
best_clf.best_estimator_

In [66]:
lr_clf = LogisticRegression(C=1, solver='liblinear', max_iter=1000)
# Use liblinear solver and increase iterations to converge regression
# add penalty for generalizability

lr_clf.fit(X_train_vectorized, y_train)

In [67]:
X_val_vectorized = cv.transform(X_val) 

y_pred = lr_clf.predict(X_val_vectorized)

test_df_vectorized = cv.transform(test_df["text"])


In [68]:

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00      2388
      Sports       1.00      1.00      1.00      2376
    Business       1.00      1.00      1.00      2432
    Sci/Tech       1.00      1.00      1.00      2404

    accuracy                           1.00      9600
   macro avg       1.00      1.00      1.00      9600
weighted avg       1.00      1.00      1.00      9600

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.86      0.81      0.83       612
      Sports       0.83      0.86      0.85       624
    Business       0.92      0.96      0.94       568
    Sci/Tech       0.89      0.86      0.88       596

    accuracy                           0.87      2400
   macro avg       0.87      0.88      0.87      2400
weighted avg       0.87      0.87      0.87      2400

Performance on the test set:
              precision    recall