# Exercise SML -- Morning day 4

### Option 1: practice with the ImDB data

* Reproduce examples from the book for SML on the IMDB data (11.2, 11.3, 11.4) (check [codefrombook.py](codefrombook.py) if you do not want to type the code)
* Play around with different options! Can you tweak the models and make them even better? Take a look back at the slides of day1-afternoon and day2-morning where we compared different vectorizers as well!

### Option 2: practice with the data from [Vermeer](/fall-2025/fall-2025_exercises/day4/exercises/data-vermeer/)

* Work with the file `train.csv` and `test.csv` in the folder [`data-vermeer`](/fall-2025/fall-2025_exercises/day4/exercises/data-vermeer/) and train a classifier using this data.

### BONUS: Try it with your own data!

### Classifying news categories with data-vermeer

In [1]:
import sys
import csv

from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

csv.field_size_limit(2**31-1)

131072

In [4]:


# clone the repo into Colab's working directory
!git clone https://github.com/annekroon/gesis-machine-learning.git

# move into the correct exercises folder
%cd gesis-machine-learning/fall-2025/fall-2025_exercises/day4/


Cloning into 'gesis-machine-learning'...
remote: Enumerating objects: 1774, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 1774 (delta 18), reused 15 (delta 15), pack-reused 1733 (from 2)[K
Receiving objects: 100% (1774/1774), 249.36 MiB | 21.31 MiB/s, done.
Resolving deltas: 100% (928/928), done.
Updating files: 100% (660/660), done.
/content/gesis-machine-learning/fall-2025/fall-2025_exercises/day4


In [9]:
def get_data(t='test'):
    text= []
    label= []

    with open(f'data-vermeer/{t}.csv') as fi:
        next(fi) # skips header row
        reader = csv.reader(fi, delimiter=',')

        for row in reader:
            text.append(row[0])
            label.append(row[1])

    return text, label

In [10]:
# load data with existing train/test split
X_test, y_test = get_data('test')
X_train, y_train = get_data('train')

In [11]:
print(len(X_test), len(y_test))
print(len(X_train), len(y_train))

698 698
2788 2788


- Which configuration achieves the best performance? Based on which metrics?
- Can you add to this configuration? (e.g., n_grams, custom tokenizer, ...?)

In [12]:
configurations = [('NB with Count', CountVectorizer(min_df=5, max_df=.5), MultinomialNB()),
                 ('NB with TfIdf', TfidfVectorizer(min_df=5, max_df=.5), MultinomialNB()),
                 ('LogReg with Count', CountVectorizer(min_df=5, max_df=.5), LogisticRegression(solver='liblinear')),
                 ('LogReg with TfIdf', TfidfVectorizer(min_df=5, max_df=.5), LogisticRegression(solver='liblinear')),
                 ('SVM with Count - rbf kernel', CountVectorizer(min_df=5, max_df=.5), SVC(kernel='rbf')),
                 ('SVM with Count - linear kernel', CountVectorizer(min_df=5, max_df=.5), SVC(kernel='linear')),
                 ('SVM with Tfidf - rbf kernel', TfidfVectorizer(min_df=5, max_df=.5), SVC(kernel='rbf')),
                 ('SVM with Tfidf - linear kernel', TfidfVectorizer(min_df=5, max_df=.5), SVC(kernel='linear')),

                 ]

for description, vectorizer, classifier in configurations:
    print(description)
    X_tr = vectorizer.fit_transform(X_train)
    X_te = vectorizer.transform(X_test)
    classifier.fit(X_tr, y_train)
    y_pred = classifier.predict(X_te)
    print(metrics.classification_report(y_test, y_pred) )
    print('\n')

NB with Count
               precision    recall  f1-score   support

     business       0.48      0.76      0.59       101
entertainment       0.95      0.73      0.83       400
        other       0.52      0.58      0.55        73
     politics       0.71      0.85      0.78       124

     accuracy                           0.74       698
    macro avg       0.66      0.73      0.68       698
 weighted avg       0.79      0.74      0.75       698



NB with TfIdf
               precision    recall  f1-score   support

     business       0.80      0.16      0.26       101
entertainment       0.67      0.99      0.80       400
        other       1.00      0.11      0.20        73
     politics       0.90      0.53      0.67       124

     accuracy                           0.70       698
    macro avg       0.84      0.45      0.48       698
 weighted avg       0.76      0.70      0.64       698



LogReg with Count
               precision    recall  f1-score   support

     bus

In [19]:
# try with bigrams
configurations = [
    ('LogReg with Count bigrams',
     CountVectorizer(min_df=5, max_df=0.5, ngram_range=(1,2)),
     LogisticRegression(solver='liblinear')),

    ('SVM with Count bigrams - linear kernel',
     CountVectorizer(min_df=5, max_df=0.5, ngram_range=(1,2)),
     SVC(kernel='linear'))
]

# Reload the data to ensure X_train and X_test contain the original text
X_test, y_test = get_data('test')
X_train, y_train = get_data('train')


for description, vectorizer, classifier in configurations:
    print(description)
    X_tr = vectorizer.fit_transform(X_train)
    X_te = vectorizer.transform(X_test)
    classifier.fit(X_tr, y_train)
    y_pred = classifier.predict(X_te)
    print(metrics.classification_report(y_test, y_pred))
    print('\n')

LogReg with Count bigrams
               precision    recall  f1-score   support

     business       0.62      0.56      0.59       101
entertainment       0.82      0.94      0.87       400
        other       0.73      0.48      0.58        73
     politics       0.90      0.71      0.79       124

     accuracy                           0.80       698
    macro avg       0.77      0.67      0.71       698
 weighted avg       0.79      0.80      0.79       698



SVM with Count bigrams - linear kernel
               precision    recall  f1-score   support

     business       0.56      0.56      0.56       101
entertainment       0.81      0.91      0.86       400
        other       0.67      0.51      0.58        73
     politics       0.89      0.69      0.78       124

     accuracy                           0.78       698
    macro avg       0.73      0.67      0.69       698
 weighted avg       0.78      0.78      0.77       698



