In [1]:
import sys
from pathlib import Path
import pandas as pd

def add_src_to_path():
    here = Path.cwd().resolve()
    for p in [here, *here.parents]:
        src = p / "src"
        if src.exists():
            sys.path.insert(0, str(src))
            print("Using src:", src)
            return
    raise FileNotFoundError("Could not find a src/ folder up the tree.")

add_src_to_path()

import importlib, paths
importlib.reload(paths)

import splits, features, models, metrics
from splits import load_splits
from data_prep import load_cleaned_cb, load_cleaned_tx

RANDOM_STATE = 42  # global seed for reproducibility

Using src: C:\cbdetect\src


## Introduction

Online communication is central to modern life, yet alongside routine coordination and social support, it often contains toxic language and cyberbullying. Cyberbullying—defined as repeated, intentional harm carried out through digital media—has been linked to psychological distress, especially among adolescents and vulnerable groups (Van Hee et al., 2018). Detecting such harmful language is therefore an urgent challenge for computational social science, psychology, and online safety research.

Early studies approached harmful speech detection with lexicons and shallow classifiers, using features such as profanity lists, n-grams, or part-of-speech tags (Waseem & Hovy, 2016; Davidson et al., 2017). These methods revealed that linguistic cues can successfully distinguish abusive from neutral text, but they also highlighted challenges such as label ambiguity (e.g., “offensive” vs. “hate” vs. “toxic”) and bias (e.g., falsely flagging identity mentions like gay or Muslim) (Dixon et al., 2018; Borkan et al., 2019). With the advent of deep learning, convolutional and recurrent networks improved detection (Agrawal & Awekar, 2018), and transformer-based models such as BERT and RoBERTa achieved state-of-the-art accuracy (Jahan et al., 2021). However, these models are computationally demanding, opaque in interpretation, and prone to replicating social biases, raising concerns about trust and fairness.

An alternative is to focus on transparent, interpretable linguistic cues that map directly to psychological constructs. Research in pragmatics and psycholinguistics shows that profanity, second-person address, negation, intensifiers, politeness markers, and identity terms can function as signals of aggression, directedness, or targeting (Jay & Janschewitz, 2008; Pennebaker et al., 2015). By quantifying such cues, one can capture meaningful differences in communication style, while maintaining interpretability and reproducibility. This perspective shifts attention from maximizing black-box accuracy toward understanding how harmful language is expressed across domains.

To investigate this, I analyze two complementary corpora. The Cyberbullying Classification dataset (AndrewMVD, Twitter-like) is multi-class and approximately balanced across targeted categories (age, gender, religion, ethnicity, other) versus non-bullying. The Toxic Comment Classification dataset (Jigsaw, Wikipedia talk pages) is multi-label, highly imbalanced, and includes overlapping categories such as toxic, obscene, threat, and identity hate. These datasets differ not only in their annotation design (single vs. multiple labels) but also in platform norms (Twitter’s short, informal, direct language vs. Wikipedia’s longer, coordination-oriented discourse). Together they form a natural experiment for examining how task framing and social context shape both cue distributions and model performance.

While prior research has demonstrated the power of deep neural networks, less is known about how interpretable cues behave across different labeling schemes and social platforms, and how such cues relate to psychological theories of aggression, politeness, and group identity. This study addresses that gap by systematically comparing cue distributions, detection outcomes, and bias patterns across the two corpora.

The purpose is to provide an interpretable, reproducible account of harmful language signals and to evaluate how platform norms and annotation schemes affect precision, recall, and fairness. Specifically, I ask: (1) how hostility, directedness, arousal, politeness, and identity cues differ between clean and harmful texts; (2) how multi-class vs. multi-label framing influences performance; (3) how platform norms shape expression and detection; (4) whether neutral identity mentions are over-flagged; and (5) whether severity- and target-oriented cues can be synthesized into a two-dimensional taxonomy.

From these questions, I hypothesize that (H1) tweets will show clearer class separation while Wikipedia comments will favor conservative decisions; (H2) cue distributions will mirror platform norms; (H3) identity mentions will be more often over-flagged in Wikipedia than Twitter; (H4) a severity-by-target taxonomy will yield interpretable class structure without degrading performance; and (H5) combining severity and target cues will improve detection of identity-related abuse.

In [2]:
# load cleaned data
cb_df = load_cleaned_cb("cleaned_cyberbullying.parquet")
tx_df = load_cleaned_tx("cleaned_jigsaw.parquet")

# load exact saved splits
cb_splits = load_splits("cb")  # expects artifacts/cb_splits.json
tx_splits = load_splits("tx")  # expects artifacts/tx_splits.json

print("CB shape:", cb_df.shape)
print("TX shape:", tx_df.shape)

# choose text columns 
def pick_col(df, candidates):
    for c in candidates:
        if c in df.columns:
            return c
    raise KeyError(f"None of these columns found: {candidates}")

# cyberbullying (twitter-like)
cb_text_col_raw  = pick_col(cb_df, ["tweet_text", "text", "tweet_text_clean"])
cb_label_mc_col  = pick_col(cb_df, ["cyberbullying_type"])
cb_label_bin_col = next((c for c in ["is_bullying","any_toxic"] if c in cb_df.columns), None)

# jigsaw (wikipedia talk)
tx_text_col_raw = pick_col(tx_df, ["comment_text", "text"])
TX_LABELS = [c for c in ["toxic","severe_toxic","obscene","threat","insult","identity_hate"] if c in tx_df.columns]
assert len(TX_LABELS) == 6, f"Missing TX label columns, found: {TX_LABELS}"

# quick schema peek
with pd.option_context("display.max_colwidth", 120):
    display(cb_df[[cb_text_col_raw, cb_label_mc_col]].head(3))
    display(tx_df[[tx_text_col_raw] + TX_LABELS].head(3))

# stratified slices (dataframes) just for showing examples now
cb_train = cb_df.iloc[cb_splits["train"]]
cb_dev   = cb_df.iloc[cb_splits["dev"]]
cb_test  = cb_df.iloc[cb_splits["test"]]

tx_train = tx_df.iloc[tx_splits["train"]]
tx_dev   = tx_df.iloc[tx_splits["dev"]]
tx_test  = tx_df.iloc[tx_splits["test"]]

# a few CB examples per multiclass category (1 per class)
print("\n[CB] one example per class:")
classes = cb_df[cb_label_mc_col].astype("category").cat.categories
for cls in classes:
    row = cb_train[cb_train[cb_label_mc_col] == cls].sample(1, random_state=RANDOM_STATE)
    txt = row.iloc[0][cb_text_col_raw]
    print(f"  {cls:>18}: {txt[:100].replace('\\n',' ')}{'…' if len(str(txt))>100 else ''}")

# a few TX positives per label (up to 2 each, if available)
print("\n[TX] positive examples per label (up to 2):")
for lab in TX_LABELS:
    pos = tx_train[tx_train[lab] == 1]
    n = min(2, len(pos))
    print(f"  {lab}: {n} example(s)")
    if n > 0:
        for txt in pos.sample(n, random_state=RANDOM_STATE)[tx_text_col_raw].tolist():
            print("    -", str(txt)[:100].replace("\n"," "), "…" if len(str(txt))>100 else "")

CB shape: (47590, 7)
TX shape: (159571, 12)


Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was crapilicious! #mkr",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImACelebrityAU #today #sunrise #studio10 #Neighbours #WonderlandTen #etc,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red velvet cupcakes?,not_cyberbullying


Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, jus...",0,0,0,0,0,0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and ...",0,0,0,0,0,0



[CB] one example per class:
                 age: are ppl making these tweets bc they bullied other girls in high school or bc they still want to be l…
           ethnicity: @AngryAngelsFan @KarenNSports @britjohnsonnnn Fuck LA dumb niggers
              gender: @slasher48 This guy is unreal. "I'm not homophobic, but... LOL haha gay rape joke bc Sam is a b****.…
   not_cyberbullying: Well that's MKR done for the year. Seriously, does anyone watch past the instant restaurant rounds? …
  other_cyberbullying: @garethnelson are you seriously defending software piracy?
            religion: @MaxBlumenthal @veganforareason @brendlewhat @dhere A Muslim woman talks about sex slaves while Max …

[TX] positive examples per label (up to 2):
  toxic: 2 example(s)
    - foot fetishes are awesome fuck you 68.228.72.192 
    - GO FOR IT SHITBAG   ENJOY JACKING YOUR 2 INCH DICK OFF WHILE YOU PRESS THE BUTTON AS WELL. 
  severe_toxic: 2 example(s)
    - F*** OFF, YOU F***ING B****! I AM TELLING THE TR

## Methods

### Data & splits

I use two public corpora with distinct label schemes and platform norms. **Cyberbullying Classification** (Kaggle; curated by Andrew MVD) [(link here)](https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification) contains >47k Twitter-like posts labeled into six categories—age, ethnicity, gender, religion, other_cyberbullying, and not_cyberbullying—and was intentionally balanced to ~8k items per class. Because examples may either narrate bullying or constitute it, I include a content warning for readers inspecting raw text. This dataset is suitable for single-label multiclass modeling and provides a relatively even class mix that reduces variance during training and evaluation.

The **Toxic Comment Classification Challenge** corpus (Kaggle; Jigsaw/Conversation AI, Google) [(link here)](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
 consists of Wikipedia talk-page comments labeled by human raters for six (non-exclusive) toxicity types: toxic, severe_toxic, obscene, threat, insult, identity_hate. It is inherently multilabel and highly imbalanced (in the released training data there are 159,571 comments; the majority are clean, while positives are concentrated in toxic/obscene/insult and are rare for threat and identity_hate), which makes thresholding and calibration particularly important later on. 

To ensure comparability and prevent leakage, I freeze 70/10/20 splits for each corpus and save them as JSON lists of row indices under artifacts. Stratification respects the label structure of each dataset: for Cyberbullying I stratify by cyberbullying type so the class mixture is stable across train/dev/test; for Toxic Comments I stratify by a coarse label-count bin (0, 1, ≥2 positive labels) to preserve the clean-vs-labeled ratio without peeking at per-label prevalences. These choices produce splits that mirror each corpus’s operational constraints (balanced multiclass vs. imbalanced multilabel) while keeping the test set untouched for final evaluation.

### Preprocessing

I normalize each corpus just enough to reduce noise while preserving signals of hostility and directedness.

**Cyberbullying (Tweets)** 
Mentions are pervasive and often carry no linguistic content beyond who is addressed. I therefore normalize every user mention to @user so specific handles do not leak into the model, and I drop tweets that become empty after removing mentions (20 cases). Tweets that are URL-only provide no lexical cues, so I remove 82 such rows; by contrast, I keep hashtag-only tweets because hashtags frequently encode slurs, targets, or rallying phrases that are informative. Text is lowercased; I remove stopwords but explicitly retain negations by defining negations = {no, not, nor, never} and using an NLTK stoplist with those four removed. I lemmatize (WordNetLemmatizer) to collapse inflections in short, noisy tweets (e.g., insulting → insult), which helps align variants of the same abusive term. I also create a binary helper label is_bullying for analyses where the task is bullying vs. not (is_bullying = 1 if cyberbullying_type != "not_cyberbullying"). This pipeline removes content-free artifacts while retaining exactly the cues we care about—negation, second-person address, punctuation—that mark directed hostility.

**Toxic Comments (Wikipedia)**
The Jigsaw corpus is multilabel and highly imbalanced (159,571 comments; ~89.8% clean, ~4.0% with exactly one label, ~6.2% with multiple). To stay fast and transparent on longer, more grammatical discourse, I apply minimal normalization: lowercase, simple word-regex tokenization, and scikit-learn’s English stopwords; I do not lemmatize to avoid blurring politeness markers, hedges, and aspect that matter in coordination talk. I keep all rows (no social-media–specific drops), and store the cleaned tokens in simple_tokens. For a global view I define any_toxic = (label_count > 0), producing a clean vs. toxic split used for quick contrastive analyses (e.g., top words). For the class-specific view, I use the native multilabel setup and compare each positive class against a clean baseline (label_count = 0), which preserves natural overlap (e.g., a comment can be both toxic and insult). As a sanity check, the top-word patterns align with expectations: clean comments are dominated by workflow/meta vocabulary (e.g., use, information, deletion, help, thank); toxic/insult show profanity and general insults (often in greeting frames like “hi, you’re …”); severe_toxic adds harsher/violent verbs; obscene concentrates sexual/body terms; threat surfaces verbs like die/kill plus moderation terms (block, ban); identity_hate concentrates group-directed slurs alongside general insults. These observations motivated keeping normalization conservative on Wikipedia while focusing stronger normalization (and lemmatization) on the tweet domain.

### Features

I represent text with two complementary TF–IDF views—words (semantics) and characters (obfuscation/style)—and use them as the shared backbone across both corpora. TF–IDF is a good fit for my questions because it is interpretable, sparse, and comparable across tasks: term frequency reflects how much a token or short phrase appears in a comment, while inverse document frequency down-weights routine coordination terms (e.g., “article”, “page”, “thanks” on Wikipedia), letting hostile or targeted language stand out. This makes it easy to inspect what the model uses and to audit identity-term reliance during fairness checks.

**Word TF–IDF**

For semantic cues, I build word unigrams and bigrams. On Cyberbullying (tweets), I feed the vectorizer the lemmatized stream (the short domain benefits from collapsing inflections so variants of the same insult align). On Jigsaw/Wikipedia, I feed the non-lemmatized token stream (simple_tokens) to preserve politeness markers, hedges, and aspect that matter in longer coordination talk. Because the domains and vocabularies differ, I fit one vectorizer per dataset with the same transparent settings: ngram_range=(1, 2), min_df=3 to ignore ultra-rare tokens, and max_features=50,000 to cap memory and runtime. Fitting happens on the training split only; I then transform dev and test with the same fitted vectorizer, so all splits live in an identical feature space and no information leaks forward. In practice, word TF–IDF surfaces slurs, directive patterns (“you are…”, “go back”), and verbs of harm—exactly the content my hypotheses probe (severity, directedness, platform norms).

**Character TF–IDF**

For obfuscation and style, I add character n-grams. Users often mask abusive words (f*ck, idi0t), elongate them (fuuuck), or convey stance via punctuation runs (!!!). Word features can miss these if the token is novel; character features fire on shared subpieces (e.g., fuc, uck, idi, i0t) and capture elongations and punctuation directly. I therefore vectorize the natural character stream: on Cyberbullying I use the cleaned tweet text that removes URLs and normalizes mentions to @user (so specific handles don’t leak) while preserving casing/punctuation; on Jigsaw I use the raw `comment_text` (URLs are less dominant there, and punctuation/elongations are already present). Settings balance detail and noise: ngram_range=(3, 5) (shorter n-grams are too generic; longer ones become brittle), min_df=5 to drop one-off typos, and max_features=30,000 to keep the block lightweight since word TF–IDF carries most semantics. As with words, I fit on train only and reuse the fitted vectorizer for dev/test..3

**Training protocol and reproducibility**

All vectorizers are fit on the training split only, then applied unchanged to dev/test; this prevents vocabulary/IDF leakage. I maintain separate fitted objects per corpus (Twitter vs. Wikipedia), save them under artifacts/ for reuse in later notebooks, and print the resulting matrix shapes as a sanity check (rows match split sizes; columns match the learned vocabulary size). The combined representation—word + char TF–IDF—supports fair, reproducible evaluation, lets me inspect coefficients for identity-term bias (H3), and gives a shared backbone for comparing precision/recall behavior across multiclass Twitter and multilabel Wikipedia (H1–H2) while remaining fast and transparent.

### Models

I use linear models because they match sparse TF–IDF well, train quickly, and expose weights that I can audit. For the cyberbullying binary task (bullying vs. not), I fit a logistic regression with elastic-net regularization (SAGA solver). The L1 part encourages sparsity (clearer feature attributions), while L2 stabilizes estimates under correlated n-grams. For the cyberbullying multiclass task (six categories), I use a linear SVM (LinearSVC): it is robust in high dimensions and yields strong margins without heavy tuning; when I need calibrated probabilities (for plots or operating-point control), I swap to one-vs-rest L2 logistic and calibrate. For the Jigsaw multilabel task (six non-exclusive labels), I train one-vs-rest linear classifiers with SGD (log_loss)—each label gets its own binary model. SGD scales to large, sparse matrices and supports early stopping; it also pairs cleanly with post-hoc probability calibration.

I deliberately avoid black-box architectures here: the research goal is to understand which linguistic cues drive decisions and how behavior changes by domain and label scheme. Linear models give me coefficients I can read (e.g., profanity, second-person, identity terms), which is essential for discussing bias and directedness, and they’re fast enough to re-fit for sensitivity checks (e.g., isotonic vs. sigmoid calibration).

**Thresholding & operating modes**

All models ultimately produce scores (margins or probabilities). Converting scores into flags requires a threshold. Uncalibrated scores are not probabilities, so the same numerical cutoff can behave very differently across labels or domains. I therefore calibrate on the dev split (Platt/sigmoid as the default; isotonic as a sensitivity variant) and then choose thresholds on dev for two operating modes:

- a balanced mode that maximizes F1 (good overall trade-off), and
- a precision-first mode that enforces a minimum precision (e.g., ≥0.95) and then picks the highest recall that satisfies it.

This separation is important under class imbalance (TX): small threshold changes can swing precision/recall dramatically. Calibrated probabilities make the “threshold knob” predictable: increasing the cutoff reliably trades recall for precision, which lets me present clear deployment options (balanced vs. precision-first) rather than a single, opaque score.

## Results

**Cyberbullying (binary)**

The calibrated elastic-net logistic baseline reaches PR-AUC = 0.983 and ROC-AUC = 0.914. At the F1-optimal threshold (≈ 0.332), performance is F1 = 0.926 with precision = 0.870 and recall = 0.990. A precision-first operating point (threshold ≈ 0.723) achieves precision = 0.953, recall = 0.843, F1 = 0.895; P@10% = 1.00 indicates the top-scored decile is essentially error-free. Feature inspection aligns with the cue hypotheses: profanity, identity terms, generalizers, and negation increase, while hashtags, URLs, mentions, exclamations decrease; simple style correlates like number of tokens and average token length also differentiate classes.

**Cyberbullying (multiclass)**

The multiclass model attains macro-F1 = 0.818. Performance is strongest on age, ethnicity, religion, gender, and weaker on not_cyberbullying and other_cyberbullying, consistent with the latter categories’ heterogeneity and overlap with targeted classes.

**Toxic Comments (multilabel)**

With OvR linear models and sigmoid calibration, the F1-optimal operating point yields micro-F1 = 0.172 and macro-F1 = 0.224. Per-label PR-AUCs are modest, reflecting class imbalance and subtler discourse structure. Under a precision-floor regime (high precision, fewer false alarms), recall drops as expected, giving micro/macro ≈ 0.039/0.064.

**Calibration sensitivity (Toxic Comments)**

Replacing sigmoid with isotonic calibration produces a small change (micro = 0.176, macro = 0.209). Differences are within uncertainty bounds and treated as a sensitivity check, not a clear improvement.

**Cross-domain signal (CB → TX, cue-only)**

Cue-only models trained on CB exhibit above-chance discrimination on TX (PR-AUC ≈ 0.12, ROC-AUC ≈ 0.55) without retuning, but zero-shot thresholds yield high recall / low precision. Retuning thresholds on TX dev improves balance slightly, evidencing platform and label-scheme domain shift.

**Uncertainty**

Approximate test-set uncertainty is CB F1 95% CI ~ [0.921, 0.930] and TX macro-F1 95% CI ~ [0.206, 0.244].

**Fairness probe (identity mentions)**

On neutral identity sentences, the positive rate (a proxy for FPR) is higher than the corpus base rate for CB toxic-any (and similarly elevated for TX identity_hate where tested). This indicates sensitivity to identity tokens absent overt hostility. Because neutrality labels can be subjective and context-dependent, these numbers are treated as risk indicators rather than definitive FPR estimates.

## Discussion

The results show that interpretable cues + TF–IDF + calibrated linear models are a strong, transparent baseline on the cyberbullying corpus and a serviceable detector on Wikipedia talk. On CB, short, direct messages with concentrated lexical aggression are well captured by word/char TF–IDF; calibration then makes operating-point selection predictable, enabling either a balanced F1 mode or a precision-first mode when false positives are costly. On TX, the combination of multilabel structure and heavy class imbalance makes threshold choice the dominant lever—small changes in calibrated cutoffs swing precision/recall substantially. Here, calibration is not optional: it stabilizes thresholding and makes PR-AUC diagnostics meaningful.

Cue behavior aligns with psycholinguistic expectations: profanity, second-person address, and negation amplify perceived hostility and directedness; character n-grams catch obfuscation, elongation, and punctuation patterns that word features would miss. The multiclass drop on not/other reflects those labels’ heterogeneity and their semantic proximity to targeted classes; this is a taxonomy rather than a hard separation. On TX, label overlap (e.g., toxic with insult or obscene) and the discourse setting (coordination, policy debate) dilute purely lexical signals, explaining the modest macro-F1 and the benefit of controlling thresholds per label.

Cross-domain analyses confirm partial transfer: cue-only models trained on CB carry some signal into TX but require retuned thresholds to be practical. This underscores that platform norms and annotation schemes are as pivotal as model class. Finally, fairness probing shows the classic tension: identity tokens are informative but risk-bearing. Linear models make this tension visible via coefficients and allow mitigation via thresholding and, if needed, feature adjustments or group-aware audits.

In deployment terms, the takeaways are pragmatic: start with the balanced operating point to maximize coverage; switch to precision-first where moderation cost or user impact warrants fewer false alarms; keep calibration in the loop to ensure thresholds behave consistently; and maintain human-in-the-loop review for flags driven primarily by identity terms. If additional lift is required, the next increments should be light fusion (e.g., add sentence embeddings to the linear stack) and only then a small transformer, balancing gains against complexity and fairness risk.