# CMT316 Coursework 1 Part 2

In this project, the goal is to train a classification machine learning model
from some English news articles and to measure the performance of the model by
its accuracy, precision, recall, and F1 score.

This notebook should be executed in order from top to bottom.

First of all, the needed libraries are imported. Also, the pipeline for the
English language for spaCy is loaded into memory.


In [1]:
from pathlib import Path

import pandas as pd
import spacy
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score ,recall_score, f1_score
from sklearn.feature_selection import SelectKBest, chi2

nlp = spacy.load("en_core_web_sm")

The dataset directory is then recursively scanned for all text files. (Assuming
the dataset is located in a local directory with the name `datasets/bbc/`)

Those text files are read into Python and fed into the spaCy pipeline loaded in
the previous step to obtain a list of tokens in the document. That pipeline
performs NLP processing including tokenization, lemmatization, and POS tagging
on the document.

The list of tokens are then iterated through to count all the unigrams, bigrams,
and word-POS pairs in the document.

Note that stopwords are skipped for unigrams, bigrams are not connected past
space/paragraph breaks, and punctuation is skipped entirely.


In [2]:
all_docs = []
for path in Path("./datasets/bbc/").glob("**/*.txt"):
    doc = nlp(path.read_text(errors="ignore"))
    category = path.parts[-2]
    doc_unigrams = {}
    doc_bigrams = {}
    doc_wordpos = {}
    prev_lemma = None
    for token in doc:
        if token.pos_ == "SPACE":
            prev_lemma = None  # don't connect bigrams past paragraph breaks
            continue
        if token.pos_ == "PUNCT":
            continue
        lemma = token.lemma_.lower()

        # Feature 1: unigram
        if not token.is_stop:
            if lemma not in doc_unigrams:
                doc_unigrams[lemma] = 1
            else:
                doc_unigrams[lemma] += 1

        # Feature 2: bigram
        if prev_lemma is not None:
            bigram = f"{prev_lemma} {lemma}"
            if bigram not in doc_bigrams:
                doc_bigrams[bigram] = 1
            else:
                doc_bigrams[bigram] += 1

        # Feature 3: word+POS
        word_pos = f"{lemma}_{token.pos_}"
        if word_pos not in doc_wordpos:
            doc_wordpos[word_pos] = 1
        else:
            doc_wordpos[word_pos] += 1

        prev_lemma = lemma

    all_docs.append((doc_unigrams, doc_bigrams, doc_wordpos, category))

Then, the list of processed documents is split into train, development, and test
sets with a ratio of 70:15:15.


In [3]:
print(f"Whole dataset: {len(all_docs)} files")

# train_test_split() shuffles the list by default
train_and_devel_docs, test_docs = train_test_split(all_docs, test_size=0.15)
train_docs, development_docs = train_test_split(train_and_devel_docs, test_size=0.18)

print(f"Train dataset: {len(train_docs)} files")
print(f"Development dataset: {len(development_docs)} files")
print(f"Test dataset: {len(test_docs)} files")

Whole dataset: 2225 files
Train dataset: 1550 files
Development dataset: 341 files
Test dataset: 334 files


A global vocabulary is built by counting the 1000 most frequent tokens for each
of the three features.


In [4]:
unigram_frequencies = {}
bigram_frequencies = {}
wordpos_frequencies = {}

for doc_unigrams, doc_bigrams, doc_wordpos, _ in train_docs:
    for token, frequency in doc_unigrams.items():
        if token not in unigram_frequencies:
            unigram_frequencies[token] = frequency
        else:
            unigram_frequencies[token] += frequency

    for token, frequency in doc_bigrams.items():
        if token not in bigram_frequencies:
            bigram_frequencies[token] = frequency
        else:
            bigram_frequencies[token] += frequency

    for token, frequency in doc_wordpos.items():
        if token not in wordpos_frequencies:
            wordpos_frequencies[token] = frequency
        else:
            wordpos_frequencies[token] += frequency

top_unigram_freqs = sorted(
    unigram_frequencies.items(), key=lambda x: x[1], reverse=True
)[:1000]
top_bigram_freqs = sorted(
    bigram_frequencies.items(), key=lambda x: x[1], reverse=True
)[:1000]
top_wordpos_freqs = sorted(
    wordpos_frequencies.items(), key=lambda x: x[1], reverse=True
)[:1000]

vocabulary = [w[0] for w in top_unigram_freqs + top_bigram_freqs + top_wordpos_freqs]
print(len(vocabulary))
print(vocabulary[:10])

3000
['say', 'year', 'mr', '-', 'people', '%', 'new', 'time', 'good', 'm']


The documents are then vectorized by creating a dataframe where the columns
automatically filter out words that are not contained in the global vocabulary.

This dataframe now contains 3000 columns representing 1000 dimensions from each
of the 3 features.


In [5]:
X_train = pd.DataFrame(
    [unigrams | bigrams | wordpos for unigrams, bigrams, wordpos, _ in train_docs],
    columns=vocabulary,
).fillna(0)
X_train

Unnamed: 0,say,year,mr,-,people,%,new,time,good,m,...,majority_NOUN,handset_NOUN,defence_NOUN,history_NOUN,defend_VERB,america_PROPN,economist_NOUN,open_PROPN,explain_VERB,mike_PROPN
0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,1.0,0.0,3.0,1.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13.0,2.0,6.0,0.0,2.0,0.0,10.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1545,4.0,1.0,0.0,0.0,0.0,2.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1546,1.0,0.0,13.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1547,6.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1548,4.0,0.0,0.0,0.0,1.0,1.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
Y_train = pd.Series([cat for _, _, _, cat in train_docs])
Y_train.value_counts()

business         361
sport            358
politics         283
tech             282
entertainment    266
Name: count, dtype: int64

Feature selection is performed to reduce the dimensionality of the training
vector, by selecting the 1000 most significant dimensions as measured by the
`sklearn.feature_selection.chi2` function.


In [7]:
selector = SelectKBest(chi2, k=1000)
selector.fit(X_train, Y_train)
X_train_selected = selector.transform(X_train)
print(f"Old train matrix: {X_train.shape}")
print(f"New train matrix: {X_train_selected.shape}")

Old train matrix: (1550, 3000)
New train matrix: (1550, 1000)


The machine learning algorithm to be used in this project is **Linear Support
Vector Classifier**. This particular algorithm has one hyperparameter that can
be configured, which is `C`, or the regularization parameter.

To select the best value of `C` for this project, a development set will be
utilized as built below.


In [8]:
X_dev = pd.DataFrame(
    [
        unigrams | bigrams | wordpos
        for unigrams, bigrams, wordpos, _ in development_docs
    ],
    columns=vocabulary,
).fillna(0)
Y_dev_gold = pd.Series([cat for _, _, _, cat in development_docs])
Y_dev_gold.value_counts()

business         84
sport            77
tech             67
entertainment    58
politics         55
Name: count, dtype: int64

A classifier is built for each of the C value candidates, and the C value which
outputs the best accuracy score is saved.


In [9]:
c_candidates = [0.1, 1, 10, 100, 1000]
best_accuracy = 0.0
for c in c_candidates:
    svc_candidate = LinearSVC(dual="auto", C=c)
    svc_candidate.fit(X_train_selected, Y_train)
    accuracy = svc_candidate.score(selector.transform(X_dev), Y_dev_gold)
    if accuracy >= best_accuracy:
        best_accuracy = accuracy
        best_c_value = c

print(f"Best C value found: {best_c_value}")
print(f"Best accuracy: {best_accuracy}")

Best C value found: 0.1
Best accuracy: 0.9706744868035191


Finally, the Linear SVC is trained using the training data.


In [10]:
svc = LinearSVC(dual="auto", C=best_c_value)
svc.fit(X_train_selected, Y_train)

The test set is assembled to evaluate the final model's performance.


In [11]:
X_test = pd.DataFrame(
    [unigrams | bigrams | wordpos for unigrams, bigrams, wordpos, _ in test_docs],
    columns=vocabulary,
).fillna(0)
Y_test_gold = pd.Series([cat for _, _, _, cat in test_docs])
Y_test_gold.value_counts()

politics         79
sport            76
business         65
entertainment    62
tech             52
Name: count, dtype: int64

A classification report is printed out that shows the accuracy, precision,
recall, and F1 scores for the final model.


In [12]:
Y_test_prediction = svc.predict(selector.transform(X_test))
print(classification_report(Y_test_gold, Y_test_prediction))

               precision    recall  f1-score   support

     business       0.90      0.97      0.93        65
entertainment       0.96      0.89      0.92        62
     politics       0.96      0.94      0.95        79
        sport       0.96      1.00      0.98        76
         tech       0.92      0.90      0.91        52

     accuracy                           0.94       334
    macro avg       0.94      0.94      0.94       334
 weighted avg       0.94      0.94      0.94       334



In [13]:
print(f"Accuracy score: {accuracy_score(Y_test_gold, Y_test_prediction)}")
print(f"Macro-averaged precision score: {precision_score(Y_test_gold, Y_test_prediction, average='macro')}")
print(f"Macro-averaged recall score: {recall_score(Y_test_gold, Y_test_prediction, average='macro')}")
print(f"Macro-averaged F1 score: {f1_score(Y_test_gold, Y_test_prediction, average='macro')}")

Accuracy score: 0.9431137724550899
Macro-averaged precision score: 0.9419090371294784
Macro-averaged recall score: 0.939376511605993
Macro-averaged F1 score: 0.9399375100928129
