# Random and majority class baselines

In this notebook we will calculate the macro F1 score for the random classifier and majority classifier baseline models. To briefly explain the models, assuming two class labels 0 and 1:

- In a random classifier, half of the predictions are randomly assigned to 0 and the other half to 1.
- In a majority classifier, all of the predictions are assigned to the class with the largest number of records.

It should be noted that these models are not machine learning models. Rather, they are probability models. Thus, we will be using the entire dataset in order to calculate the F1 score.

## Calculate metrics

Here we only load one of the medium datasets since they all share the same base dataset.

In [1]:
# Define functions for calculating the metrics

# f1 score
def f1(precision: float, recall: float) -> float:
    return 2 * (precision * recall) / (precision + recall)

# Macro f1 for random classifier
def random_macro_f1(misinfo_pct: float) -> float:
    prec_misinfo = misinfo_pct
    rec_misinfo = 0.5
    f1_misinfo = f1(prec_misinfo, rec_misinfo)
    prec_factual = 1 - misinfo_pct
    rec_factual = 0.5
    f1_factual = f1(prec_factual, rec_factual)
    return (f1_misinfo + f1_factual) / 2

# Macro f1 for majority classifier
def majority_macro_f1(misinfo_pct: float) -> float:
    prec_misinfo = 1.
    rec_misinfo = misinfo_pct
    f1_misinfo = f1(prec_misinfo, rec_misinfo)
    f1_factual = 0.
    return (f1_misinfo + f1_factual) / 2

In [2]:
from pathlib import Path
import warnings
import pandas as pd

# Get datasets
data_dir = Path("../../data/")
data_files = ["mumin_medium-id_trans-indobert_hashtags.csv", "mumin_large-trans-indobert_hashtags.csv"]
mumin_med_df, mumin_large_df = [pd.read_csv(data_dir.joinpath(data_file)) for data_file in data_files]

# Calculate the metrics
misinfo_pcts = {lab: df.label.mean() for lab, df in zip(["medium", "large"], [mumin_med_df, mumin_large_df])}

for size, misinfo_pct in misinfo_pcts.items():
    random_f1 = random_macro_f1(misinfo_pct)
    majority_f1 = majority_macro_f1(misinfo_pct)
    print(f'{size}: random F1 = {100 * random_f1:.2f} and majority F1 = {100 * majority_f1:.2f}')

medium: random F1 = 37.32 and majority F1 = 48.71
large: random F1 = 37.08 and majority F1 = 48.80
