# A Simple Machine Learning Workflow

This notebook will guide you through a basic supervised machine learning workflow.

> 💡 If you have brought your own dataset, try applying these steps to it at the end of this notebook.

<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/01_preprocessing.png?raw=1" alt="Preprocessing diagram" style="max-width: 150px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/02_enriching.png?raw=1" alt="Enriching diagram" style="max-width: 150px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/03_vectorization.png?raw=1" alt="Vectorization diagram" style="max-width: 150px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/05_modelling.png?raw=1" alt="Modelling diagram" style="max-width: 150px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/06_evaluation.png?raw=1" alt="Evaluation diagram" style="max-width: 150px;">


You can **choose** to work with either of the following datasets:

- The [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) - a collection of 50,000 labeled IMDB reviews for binary sentiment classification.

- The [NELA-PS dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YHWTFC) -  a collection of 'pink slime' partisan news articles (from which we will use a small sample).

Your **goal** is to train a basic supervised machine learning model to classify text for these datasets:
- For the Large Movie Review Dataset, the objective is to **predict whether a movie review is positive or negative**.
- For the NELA-PS dataset, the objective is to **predict the outlet that published a given article**.

You will do this by training the model on the 'train' subset of the dataset, and then evaluating its performance on the 'test' subset.

> ⚠️ Remember: the pre-processing, enriching and vectorization steps can be applied in many different ways. Experimentation is very important. On Day 4, you will learn how to systematically perform this experimentation with different choices.

> 💡 The eagle-eyed among you will notice that we have not included the Dimensional Reduction and Clustering 'lego brick' of the course pipeline, in this workbook. That is because here we are training a simple model, using an equally simple Bag-of-Words representation of the text data. As such, this is not required (yet). Tomorrow (Day 3), we will explore Dimentional Reduction and Clustering in detail.


# 0. Setup

In [4]:
%pip install scikit-learn



In [6]:
%pip install spacy



In [7]:
#packages you have seen already -
import os #for os operations
from glob import glob #for filepath operations
import pandas as pd #for dataframes
import numpy as np #for numerical operations

#some new packages -
import re #for regex operations (used in pre-processing)
import spacy #for the 'enrich step' - Part-of-Speech tagging and Named Entity Recognition (NER)

#packages required for our ML pipeline
from sklearn.preprocessing import LabelEncoder #for encoding labels as numbers
from sklearn.model_selection import train_test_split #for splitting data into training and test sets
from sklearn.feature_extraction.text import CountVectorizer #a vectorizer (converts text to numbers)
from sklearn.naive_bayes import MultinomialNB #the model (a Naive Bayes classifier)
from sklearn.metrics import confusion_matrix, classification_report #for model evaluation

## 1. Preprocessing  
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/01_preprocessing.png?raw=1" alt="Preprocessing diagram" style="max-width: 150px;">

In this step, we will load in the data, explore it, and clean it up a bit.

> 💡 As you have seen this-morning, vectorization (which comes after pre-processing) can actually take care of a lot of pre-processing for you (e.g., stopword removal, lemmatization, etc). However, there are two reasons why pre-processing remans important:
- The first is an **input reason**: your text might not be very 'clean' to begin with. It might contain non-unicode characters or html tags (e.g. <\/br>) that will confuse the vectorizer.
- The second it an **output reason**: perhaps you want something different or targeted than a verbatim copy of the text, that is motivated by theoretical / reserach question reasons.

⚠️ Therefore, think carefully about what pre-processing steps you want to apply, and why. You can always come back and change this part of the pipeline at any time. Do experiment.


In [9]:
# Actual first upload data to colab
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
   print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

KeyboardInterrupt: 

In [10]:
#first, let's define where this dataset is stored on your computer:
#datadir = '/Users/rupertkiddle/Downloads/'

#load the data into a df using .csv:
df_nela = pd.read_csv(os.path.join("/content", 'nela_data.csv'))

df_nela.head(3) #preview the first three rows of the dataframe

Unnamed: 0,date,outlet,text
0,2018-07-14,BBC,'Mohamed Salah could be like Pele'\n\nThe play...
1,2018-07-14,BBC,Record numbers of women are standing for elect...
2,2018-07-14,BBC,Michael is one of the few African men to have ...


In [11]:
#MISSING DATA -
#i.e., check if any rows have missing data, and remove them if so.
print(df_nela.isnull().sum()) #check how many missing values there are in each column

#remove rows with missing values
df_nela = df_nela.dropna()

date      0
outlet    0
text      1
dtype: int64


In [12]:
#print first row of text:
print(df_nela['text'][0])

'Mohamed Salah could be like Pele'

The player's first coach says he could become the best footballer in world.


In [13]:
#TEXT NORMALIZATION -
#i.e., ensuring the text is in a standard format.

#Let's inspect the text before and after each cleaning step:
print(f"first row of text before cleaning:\n{df_nela['text'][0]}")

#first, let's begin  by normailizing the text - lowercasing, stripping lead/tail whitespace.
df_nela['text'] = df_nela['text'].str.lower().str.strip()
print(f"first row of text after lowercasing, stripping whitespace:\n{df_nela['text'][0]}")

#then, let's remove any internal whitespace (e.g., new lines, tabs, multiple spaces)
df_nela['text'] = df_nela['text'].apply(lambda s: ' '.join(str(s).split()))

#We can also remove any non-UTF-8 characters with the following line:
df_nela['text'] = df_nela['text'].apply(lambda x: x.encode('utf-8', 'ignore').decode('utf-8'))
print(f"first row of text after removing non-UTF-8 characters:\n{df_nela['text'][0]}")

#Let's also check if there are any html tags in the text:
#regex pattern: begins and ends with < and >; anything in between (*), non greedy (?)
def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)
df_nela['text'] = df_nela['text'].apply(remove_html_tags)
print(f"first row of text after removing html tags:\n{df_nela['text'][0]}")


first row of text before cleaning:
'Mohamed Salah could be like Pele'

The player's first coach says he could become the best footballer in world.
first row of text after lowercasing, stripping whitespace:
'mohamed salah could be like pele'

the player's first coach says he could become the best footballer in world.
first row of text after removing non-UTF-8 characters:
'mohamed salah could be like pele' the player's first coach says he could become the best footballer in world.
first row of text after removing html tags:
'mohamed salah could be like pele' the player's first coach says he could become the best footballer in world.


In [14]:
#PREVENTING LABEL LEAKAGE -
#i.e., ensuring that the text does not contain the label we are trying to predict.

#first, let's get the unique labels in the dataset:
print(df_nela['outlet'].unique())
#we can see that the labels are the names of the news outlets.

#print how many times each outlet appears in the text column:
for outlet in df_nela['outlet'].unique():
    count = df_nela['text'].str.contains(outlet.lower()).sum()
    print(f"{outlet}: {count} instances in text")

#Let's remove those outlet names from the text:
for outlet in df_nela['outlet'].unique():
    df_nela['text'] = df_nela['text'].str.replace(outlet.lower(), '')

['BBC' 'The Guardian' 'Infowars' 'CNN' 'Vox']
BBC: 727 instances in text
The Guardian: 439 instances in text
Infowars: 381 instances in text
CNN: 2266 instances in text
Vox: 728 instances in text


# 2. Enriching
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/02_enriching.png?raw=1" alt="Enriching diagram" style="max-width: 150px;">

In this step, we will 'tag' our tokens with additional information, using Part-of-Speech (POS) tagging and Named Entity Recognition (NER).

> 💡 This step }is not strictly necessary. If you are short on time, skip it, and come back later.

In [53]:
#PERFORMING PoS TAGGING and NER -

#load spacy model (download if not present)
nlp = spacy.load('en_core_web_sm')

# function returns a tuple: (pos_tags_list, entities_list)
def enrich_text(text):
    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc]
    # store entities as tuples of (text, label) for clarity
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return pos_tags, entities


#since this is just for illustration, let's just process first 5 rows.
df_nela_sample5 = df_nela.head(10).copy()

#apply to the 'text' column, and create two new columns: 'pos_tags' and 'entities'
df_nela_sample5[['pos_tags', 'entities']] = df_nela_sample5['text'].apply(lambda x: pd.Series(enrich_text(x)))


In [54]:
#EXPLORING THE PoS TAGS and NER -
#NOTE: the above code added two new columns (pos_tags and entities)

#let's take a look at the output:
print(df_nela_sample5[['text', 'pos_tags', 'entities']])

                                                text  \
0  'Mohamed Salah could be like Pele'\n\nThe play...   
1  Record numbers of women are standing for elect...   
2  Michael is one of the few African men to have ...   
3  The 12 Thai boys rescued after spending 17 day...   
4  Islamic officials in Malaysia have launched an...   
5  The prime minister has warned Conservative par...   
6  Two of the most monstrous regimes in human his...   
7  After two months apart, a Honduran mother and ...   
8  The political operative Roger Stone has admitt...   
9  For five years, Beck Dorey-Stein was a stenogr...   

                                            pos_tags  \
0  [PUNCT, PROPN, PROPN, AUX, AUX, ADP, PROPN, PA...   
1  [ADJ, NOUN, ADP, NOUN, AUX, VERB, ADP, NOUN, A...   
2  [PROPN, AUX, NUM, ADP, DET, ADJ, ADJ, NOUN, PA...   
3  [DET, NUM, PROPN, NOUN, VERB, ADP, VERB, NUM, ...   
4  [ADJ, NOUN, ADP, PROPN, AUX, VERB, DET, NOUN, ...   
5  [DET, ADJ, NOUN, AUX, VERB, ADJ, NOUN, NOUN,

> 💡 `POS-tagging` is very useful if you want to identify certain types of phrases within your texts (e.g., NOUN+ADJ like "angry protestors").

> 💡 `Named Entity Recognition` is useful if you want to identify mentions of specific entities (e.g., people, organisations, locations, etc).

# 3. Vectorization
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/03_vectorization.png?raw=1" alt="Vectorization diagram" style="max-width: 150px;">

In this step, we will convert our text data into a numerical format that can be used by machine learning algorithms.

In [18]:
#OPTION 1: COUNT VECTORIZER -

#Let's define a CountVectorizer with some parameters (you can experiment with these):
vectorizer_CV = CountVectorizer(lowercase=False, stop_words='english', ngram_range=(1, 2), min_df=5, max_df=0.8)

#fit the vectorizer to the enriched text, and transform the text to a document-term matrix
X_CV = vectorizer_CV.fit_transform(df_nela['text'])

#print the number of terms (i.e., columns) in the document-term matrix
print(f"CountVectorizer - number of features: {len(vectorizer_CV.get_feature_names_out())}")

#print the number of documents (i.e., rows) in the document-term matrix:
print(f"CountVectorizer - number of documents: {X_CV.shape[0]}")

CountVectorizer - number of features: 105393
CountVectorizer - number of documents: 9999


In [19]:
#OPTION 2: TF-IDF VECTORIZER -

#Let's define a TfidfVectorizer with some parameters (you can experiment with these):
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_TFIDF = TfidfVectorizer(lowercase=False, stop_words='english', ngram_range=(1, 2), min_df=5, max_df=0.8)

#fit the vectorizer to the enriched text, and transform the text to a document-term matrix
X_TFIDF = vectorizer_TFIDF.fit_transform(df_nela['text'])

#print the number of features (i.e., unique tokens) in the document-term matrix
print(f"TfidfVectorizer - number of features: {len(vectorizer_TFIDF.get_feature_names_out())}")

#print the number of documents (i.e., rows) in the document-term matrix:
print(f"TfidfVectorizer - number of documents: {X_TFIDF.shape[0]}")

TfidfVectorizer - number of features: 105393
TfidfVectorizer - number of documents: 9999


In [20]:
#WHAT IS THE DIFFERENCE BETWEEN COUNT VECTORIZER AND TF-IDF VECTORIZER?
#Q: what is the highest and lowest weighted word in each document-term matrix, and it's value?
#NOTE: This is for illustrative purposes, you do not need to understand this code in detail.

def analyze_vectorizer(X, vectorizer, name):
    """Analyze top 5 and bottom 5 words for the first document only"""
    # Convert to array and get values for the first document (row 0)
    X_array = X.toarray()
    first_doc_values = X_array[0, :]

    # Get feature names
    features = vectorizer.get_feature_names_out()

    # Get indices for top 5 and bottom 5 (excluding zeros)
    non_zero_mask = first_doc_values > 0
    non_zero_values = first_doc_values[non_zero_mask]
    non_zero_features = features[non_zero_mask]

    # Sort by values
    sorted_indices = np.argsort(non_zero_values)

    print(f"\n{name} - First Document:")
    print("Top 5 words:")
    for i in range(-1, -6, -1):  # Last 5 (highest)
        idx = sorted_indices[i]
        print(f"  '{non_zero_features[idx]}': {non_zero_values[idx]:.4f}")

    print("Bottom 5 words:")
    for i in range(5):  # First 5 (lowest non-zero)
        idx = sorted_indices[i]
        print(f"  '{non_zero_features[idx]}': {non_zero_values[idx]:.4f}")

# Analyze both vectorizers
analyze_vectorizer(X_CV, vectorizer_CV, "CountVectorizer")
analyze_vectorizer(X_TFIDF, vectorizer_TFIDF, "TF-IDF Vectorizer")

#and print the text of the article for comparison:
print(f"\nText of the first document:\n{df_nela['text'].iloc[0]}")


CountVectorizer - First Document:
Top 5 words:
  'world': 1.0000
  'says': 1.0000
  'salah': 1.0000
  'player': 1.0000
  'mohamed salah': 1.0000
Bottom 5 words:
  'best': 1.0000
  'coach': 1.0000
  'footballer': 1.0000
  'like': 1.0000
  'mohamed': 1.0000

TF-IDF Vectorizer - First Document:
Top 5 words:
  'footballer': 0.4476
  'mohamed salah': 0.4476
  'salah': 0.4098
  'mohamed': 0.4059
  'coach': 0.3158
Bottom 5 words:
  'like': 0.1112
  'says': 0.1473
  'world': 0.1491
  'best': 0.1764
  'player': 0.2817

Text of the first document:
'mohamed salah could be like pele' the player's first coach says he could become the best footballer in world.


# 4. Modelling
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/05_modelling.png?raw=1" alt="Modelling diagram" style="max-width: 150px;">

In this step, we will train a simple supervised machine learning model (Multinomial Naive Bayes) on the 'training' data subset.

Remember, we refer to to:
- The 'features' as **X** (the input data, i.e., the vectorized text data)
- The 'labels' as **y** (the output data, i.e., the sentiment labels for the movie reviews, or the outlet labels for the news articles)

In [22]:
#We need to encode the labels (i.e., the news outlets) as numbers:
le = LabelEncoder()
y = le.fit_transform(df_nela['outlet']) #encode the labels as numbers

#print what outlets they correspond to:
print(f"Label mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

Label mapping: {'BBC': np.int64(0), 'CNN': np.int64(1), 'Infowars': np.int64(2), 'The Guardian': np.int64(3), 'Vox': np.int64(4)}


In [29]:
#next , we will split the data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X_CV, y, test_size=0.2, random_state=42, stratify=y)

#Now, we everything is ready to train a model!

#We will use a simple Multinomial Naive Bayes classifier:
model = MultinomialNB() #here we instantiate (create) the model so that we can use it.

#fit the model to the training data:
model.fit(X_train, y_train) #giving it the training data (X_train) and the labels (y_train).DS_Store

#predict the labels for the test data:
#NOTE: we need the true labels (y_test) out to evaluate the model in the next step.
y_pred = model.predict(X_test) #predict the labels for the test data (X_test)

# 5. Evaluation
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/06_evaluation.png?raw=1" alt="Evaluation diagram" style="max-width: 150px;">

In this step, we will evaluate the performance of our trained model, b

In [30]:
#we can also visualize the confusion matrix
confusion_matrix_df = pd.DataFrame(confusion_matrix(y_test, y_pred), index=le.classes_, columns=le.classes_)
confusion_matrix_df

Unnamed: 0,BBC,CNN,Infowars,The Guardian,Vox
BBC,362,12,11,12,3
CNN,8,317,8,31,36
Infowars,23,33,246,61,37
The Guardian,11,54,13,282,40
Vox,9,40,26,41,284


In [31]:
#the classification report gives us precision, recall, f1-score for each class (i.e., each label)
print(classification_report(y_test, y_pred))

#print our labels again for reference:
print(f"Label mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

#for now, higher = better (we will cover these metrics on D4).DS_Store
#if curious:
#Precision = TP / (TP + FP) — of predicted positives, how many are correct?
#Recall = TP / (TP + FN) — of actual positives, how many did you find?
#F1 = harmonic mean of precision and recall.
#support = number of occurances of the class (the label).

              precision    recall  f1-score   support

           0       0.88      0.91      0.89       400
           1       0.70      0.79      0.74       400
           2       0.81      0.61      0.70       400
           3       0.66      0.70      0.68       400
           4       0.71      0.71      0.71       400

    accuracy                           0.75      2000
   macro avg       0.75      0.75      0.74      2000
weighted avg       0.75      0.75      0.74      2000

Label mapping: {'BBC': np.int64(0), 'CNN': np.int64(1), 'Infowars': np.int64(2), 'The Guardian': np.int64(3), 'Vox': np.int64(4)}


> 💡 How did it perform? Better, or worse, for certain classes (labels)? Consider returning to your pre-processing and vectorization steps, to see the effects of different choices.

# 6. "All Together Now" (optional)
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/01_preprocessing.png?raw=1" alt="Preprocessing diagram" style="max-width: 120px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/03_vectorization.png?raw=1" alt="Vectorization diagram" style="max-width: 120px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/05_modelling.png?raw=1" alt="Modelling diagram" style="max-width: 120px;">
<img src="https://github.com/annekroon/gesis-machine-learning/blob/main/pictures/lego_stack/06_evaluation.png?raw=1" alt="Evaluation diagram" style="max-width: 120px;">

Sklearn's Pipeline() allows us to combine multiple steps into a single object. This is very useful for automating the process of building and evaluating models, especially when we want to systematically test different configurations (e.g., different pre-processing steps, vectorizers, models, etc).

In [None]:
#import it like this:
from sklearn.pipeline import Pipeline

#we can then create a pipeline with our vectorizer and model:
#NOTE: mouse over the Pipeline() function to see what else you can add.
my_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=False, stop_words='english', ngram_range=(1, 2), min_df=5, max_df=0.8)),
    ('model', MultinomialNB())
])

#split the data into training and test sets again:
#NOTE: this time we are passing the raw text data (df_nela['text']).
X_text = df_nela['text']
X_train, X_test, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42, stratify=y)

#fit the pipeline to the training data:
my_pipeline.fit(X_train, y_train)

#predict the labels for the test data:
y_pred = my_pipeline.predict(X_test)

> 💡 That's it! Pipeline just allows us to streamline the process of building and evaluating our model by encapsulating all the steps into a single object. This becomes essential for automatically testing different configurations to see what performs best - a 'grid search' - which we will explore on Day 4.

# 7. Try on your own data (optional)

In [32]:
#...# Actual first upload data to colab
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
   print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

Saving scraped_data-test2.csv to scraped_data-test2.csv
User uploaded file "scraped_data-test2.csv" with length 174816 bytes


In [34]:
#load the data into a df using .csv:
df_tiktok = pd.read_csv(os.path.join("/content", 'scraped_data-test2.csv'))

df_tiktok.head(3) #preview the first three rows of the dataframe

Unnamed: 0,date,link,contentID,duration,authorID,AdContent,AdAccount,diggCount,commentCount,playCount,shareCount,username,description
0,2025-07-22 09:48:11,https://www.tiktokv.com/share/video/7505430917...,7505430917360487702,59.0,7.252552e+18,False,False,71700.0,451.0,721800.0,3069.0,likeme.edits.varo,Beetje laat met deze trend hahaha #kijkenmaar...
1,2025-07-22 09:48:18,https://www.tiktokv.com/share/video/7519052633...,7519052633512873238,47.0,7.476754e+18,False,False,36300.0,1168.0,819500.0,21000.0,pov_sensei,POV: You Are Visiting Brussels In 2050 #brusse...
2,2025-07-22 09:48:32,https://www.tiktokv.com/share/video/7526540252...,7526540252249345312,269.0,7.433888e+18,False,False,31700.0,171.0,323300.0,7628.0,heethoofdcartoons,Heethoofd zoekt geld💸💰 #fyp #voorjou #antwerpe...


In [41]:
# only keep columns that we work with
df_small = df_tiktok[["contentID", "AdContent", "description"]].copy()
df_small.head()


KeyError: "None of [Index(['contentID', 'AdContent', 'description'], dtype='object')] are in the [columns]"

In [42]:
# check the balance of ad content
df_small["AdContent"].value_counts()


Unnamed: 0_level_0,count
AdContent,Unnamed: 1_level_1
False,517
True,25


In [43]:
#MISSING DATA -
#i.e., check if any rows have missing data, and remove them if so.
print(df_small.isnull().sum()) #check how many missing values there are in each column

#remove rows with missing values
df_small = df_small.dropna()

print("Remaining rows:", len(df_small))


contentID      0
AdContent      0
description    0
dtype: int64
Remaining rows: 542


In [49]:
#print first row of text:
print(df_small['description'][1])

POV: You Are Visiting Brussels In 2050 #brussels #belgium #bruxelles #veo3 #aivideo 


In [52]:
import re

# Inspect before cleaning
print("Before cleaning:\n", df_small["description"].iloc[100])

# 1. Lowercase + strip leading/trailing whitespace
df_small["desc_clean"] = df_small["description"].str.lower().str.strip()

# 2. Normalize spaces (remove \n, tabs, extra spaces)
df_small["desc_clean"] = df_small["desc_clean"].apply(lambda s: " ".join(str(s).split()))

# Preview
print("After cleaning:\n", df_small["desc_clean"].iloc[100])

Before cleaning:
 Wyd in this situation? #s1k #bmw #motorcycle #streetbike #bikelife 
After cleaning:
 wyd in this situation? #s1k #bmw #motorcycle #streetbike #bikelife


In [57]:
# Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# Use the cleaned descriptions column
vectorizer_TFIDF = TfidfVectorizer(
    lowercase=False,          # lowercase is useful here - but this was already done...
    stop_words="english",    # remove common stopwords
    ngram_range=(1, 2),      # unigrams + bigrams
    min_df=5,                # ignore rare tokens (<5 docs)
    max_df=0.8               # ignore too common tokens (>80% docs)
)

X_TFIDF = vectorizer_TFIDF.fit_transform(df_small["desc_clean"])

print(f"TfidfVectorizer - number of features: {len(vectorizer_TFIDF.get_feature_names_out())}")
print(f"TfidfVectorizer - number of documents: {X_TFIDF.shape[0]}")


TfidfVectorizer - number of features: 169
TfidfVectorizer - number of documents: 542


In [59]:
#Let's define a CountVectorizer with some parameters (you can experiment with these):
vectorizer_CV = CountVectorizer(
    lowercase=False,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.8
  )

#fit the vectorizer to the enriched text, and transform the text to a document-term matrix
X_CV = vectorizer_CV.fit_transform(df_small['desc_clean'])

#print the number of terms (i.e., columns) in the document-term matrix
print(f"CountVectorizer - number of features: {len(vectorizer_CV.get_feature_names_out())}")

#print the number of documents (i.e., rows) in the document-term matrix:
print(f"CountVectorizer - number of documents: {X_CV.shape[0]}")

CountVectorizer - number of features: 169
CountVectorizer - number of documents: 542


In [72]:
# ---- Analysis function ----
def analyze_vectorizer(X, vectorizer, name):
    """Analyze top 5 and bottom 5 words for the first document only"""
    X_array = X.toarray()
    first_doc_values = X_array[0, :]

    # Feature names
    features = vectorizer.get_feature_names_out()

    # Only keep non-zero terms
    non_zero_mask = first_doc_values > 0
    non_zero_values = first_doc_values[non_zero_mask]
    non_zero_features = features[non_zero_mask]

    if len(non_zero_values) == 0:
        print(f"\n{name} - First document has no non-zero terms.")
        return

    # Sort values
    sorted_indices = np.argsort(non_zero_values)

    print(f"\n{name} - First Document:")
    print("Top words:")
    # Print up to the top 5 words
    for i in range(1, min(6, len(sorted_indices) + 1)):
        idx = sorted_indices[-i]
        print(f"  '{non_zero_features[idx]}': {non_zero_values[idx]:.4f}")

    print("Bottom words:")
    # Print up to the bottom 5 words
    for i in range(min(5, len(sorted_indices))):
        idx = sorted_indices[i]
        print(f"  '{non_zero_features[idx]}': {non_zero_values[idx]:.4f}")

# ---- Run analysis on both ----
analyze_vectorizer(X_CV, vectorizer_CV, "CountVectorizer")
analyze_vectorizer(X_TFIDF, vectorizer_TFIDF, "TF-IDF Vectorizer")

# ---- Print first description for comparison ----
print("\nFirst TikTok description:\n", df_small["desc_clean"].iloc[0])


CountVectorizer - First Document:
Top words:
  'trend': 1.0000
  'met': 1.0000
  'deze': 1.0000
Bottom words:
  'deze': 1.0000
  'met': 1.0000
  'trend': 1.0000

TF-IDF Vectorizer - First Document:
Top words:
  'trend': 0.6381
  'deze': 0.5842
  'met': 0.5015
Bottom words:
  'met': 0.5015
  'deze': 0.5842
  'trend': 0.6381

First TikTok description:
 beetje laat met deze trend hahaha #kijkenmaarnietliken? #kijkenmaarnietliken #pommelienthijs #fyppp #fyppppppp #fypppppppppppppp #virallll #viralllllll #virallllllllllllll #ketnetmeisjes #ketnet #meisjes #cleovanmeisjes #jackencleo #foruu #blowthisupforme #blowthisuptiktok #pommelienpommelien #knokkeheist #knokke #knokkeoff2 #knokkeoff


In [73]:
# encode the labels true or false
le = LabelEncoder()
y = le.fit_transform(df_small["AdContent"])  # encode as 0/1

# Show mapping
print(f"Label mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")

Label mapping: {False: np.int64(0), True: np.int64(1)}


In [86]:
#next , we will split the data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X_CV, y, test_size=0.2, random_state=42, stratify=y)

#Now, we everything is ready to train a model!

#We will use a simple Multinomial Naive Bayes classifier:
model = MultinomialNB() #here we instantiate (create) the model so that we can use it.

#fit the model to the training data:
model.fit(X_train, y_train) #giving it the training data (X_train) and the labels (y_train).DS_Store

#predict the labels for the test data:
#NOTE: we need the true labels (y_test) out to evaluate the model in the next step.
y_pred = model.predict(X_test) #predict the labels for the test data (X_test)

In [87]:
# EVALUATE
# Classification report (precision, recall, f1 for each class)
print(classification_report(y_test, y_pred, target_names=le.classes_.astype(str)))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.97      0.99      0.98       104
        True       0.67      0.40      0.50         5

    accuracy                           0.96       109
   macro avg       0.82      0.70      0.74       109
weighted avg       0.96      0.96      0.96       109

Confusion Matrix:
[[103   1]
 [  3   2]]
