**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [15]:
# imports for the project

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from collections import Counter


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [3]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [4]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

2. Split the data

In [5]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(960,) (240,) (960,) (240,)


3. Build the BoW Model

In [6]:
# countvectorizer
cv = CountVectorizer()
X_train_vectorized = cv.fit_transform(X_train)

In [21]:
X_train_vectorized.todense()
cv.vocabulary_

{'movie': 4425,
 'studios': 6556,
 'sue': 6587,
 'us': 7185,
 'european': 2404,
 'file': 2620,
 'sharing': 6102,
 '39': 114,
 'parasites': 4869,
 'los': 4012,
 'angeles': 494,
 'have': 3112,
 'expanded': 2451,
 'their': 6836,
 'fight': 2608,
 'against': 344,
 'illegal': 3334,
 'downloading': 2150,
 'suing': 6598,
 'more': 4399,
 'than': 6827,
 '100': 9,
 'and': 485,
 'based': 815,
 'computer': 1585,
 'services': 6063,
 'that': 6831,
 'transmit': 6998,
 'files': 2622,
 'across': 271,
 'internet': 3502,
 'networks': 4541,
 'dvr': 2204,
 'to': 6921,
 'the': 6833,
 'rescue': 5663,
 'news': 4553,
 'reachs': 5477,
 'from': 2811,
 'america': 455,
 'of': 4659,
 'another': 520,
 'strange': 6518,
 'tale': 6722,
 'dvd': 2202,
 'hardwire': 3101,
 'going': 2951,
 'haywire': 3115,
 'couple': 1719,
 'weeks': 7422,
 'ago': 357,
 'man': 4095,
 'in': 3373,
 'oregon': 4743,
 'was': 7382,
 'paid': 4836,
 'an': 475,
 'unexpected': 7121,
 'visit': 7293,
 'air': 378,
 'force': 2736,
 'after': 338,
 'his': 31

Test: Print most common names

In [22]:
# Sum the occurrences of each word across all documents
word_counts = np.asarray(X_train_vectorized.sum(axis=0)).flatten()

# Create a DataFrame with words and their corresponding counts
word_freq = pd.DataFrame({'word': cv.get_feature_names_out(), 'count': word_counts})

# Sort the DataFrame by count in descending order and display the top 10 words
top_10_words = word_freq.sort_values(by='count', ascending=False).head(10)
print(top_10_words)

      word  count
6833   the   1688
6921    to    942
4659    of    796
3373    in    758
485    and    567
4695    on    432
2734   for    383
114     39    350
623     as    225
6831  that    221


In [31]:
word = "the"
if word in cv.get_feature_names_out():
    print(f"The word {word} is in the vocabulary.")
else:
    print(f"The word {word} is not in the vocabulary.")

the_index = cv.get_feature_names_out().tolist().index(word)
the_count = word_counts[the_index]
print(f"The word {word} occurred {the_count} times in the training set.")
# Ensure that the word "the" is in the vocabulary
if word in cv.get_feature_names_out():
    print(f"The word {word} is in the vocabulary.")
else:
    print(f"The word {word} is not in the vocabulary.")

# Get the index of the word "the" in the vocabulary
the_index = cv.get_feature_names_out().tolist().index(word)

# Use this index to find the count of the word "the" in the word_counts array
the_count = word_counts[the_index]
print(f"The word {word} occurred {the_count} times in the training set.")

The word the is in the vocabulary.
The word the occurred 1688 times in the training set.
The word the is in the vocabulary.
The word the occurred 1688 times in the training set.


In [16]:
# Assuming X_train is your dataset containing sentences
sentence_counts = Counter(X_train)

# Get the top 10 most common sentences
top_10_sentences = sentence_counts.most_common(10)

# Print the top 10 sentences in a table
print(pd.DataFrame(top_10_sentences, columns=["sentence", "count"]))

                                            sentence  count
0  Movie studios sue US, European file-sharing  #...      1
1  DVR To The Rescue! News reachs us from America...      1
2  services will make cash flow, says Vodafone Vo...      1
3  Phelps, U.S. Win Men's 4x200 Freestyle Relay  ...      1
4  Applied Materials Signals Caution Applied Mate...      1
5  Verizon Wireless to Buy NextWave Licenses (AP)...      1
6  Bailey welcomes Azeri to the Breeders #39; Cup...      1
7  US blames Islamic charities for funding Iraq a...      1
8  Oracle victorious in quest for PeopleSoft It #...      1
9  Target getting bum rap for ending  #39;ringer ...      1


4. Create classifier

In [27]:
lr_clf = LogisticRegression() # Note that we can set hyperparameters here

lr_clf.fit(X_train_vectorized, y_train)

5. Get predictions and evaluate the model

In [28]:
X_val_vectorized = cv.transform(X_val) # note that we use transform here, not fit_transform

y_pred = lr_clf.predict(X_val_vectorized)

In [29]:
print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00      1.00       960
weighted avg       1.00      1.00      1.00       960

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.76      0.68      0.72        62
      Sports       0.69      0.60      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.78      0.90      0.83        58

    accuracy                           0.76       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.76      0.75       240



In [30]:
test_df_vectorized = cv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.74      0.72      0.73       190
      Sports       0.75      0.72      0.73       190
    Business       0.83      0.88      0.86       190
    Sci/Tech       0.79      0.79      0.79       190

    accuracy                           0.78       760
   macro avg       0.78      0.78      0.78       760
weighted avg       0.78      0.78      0.78       760

