# Stylometric Analysis of Troll Detection

This notebook explores using stylometric features for troll detection. We'll investigate whether simple text statistics and writing style markers can help identify troll-like behavior. This serves as a baseline approach using classical machine learning models.

## Setup and Data Loading

First, we import necessary libraries and load our preprocessed data splits. We'll use various scikit-learn components for building our machine learning pipelines.

In [19]:
# Imports and Setup
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import logging
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.svm import LinearSVR
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge

# Add project root to path
sys.path.append(str(Path.cwd().parent))

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [20]:
# Define paths
DATA_DIR = Path('data')
PROCESSED_DATA_DIR = DATA_DIR / 'processed'
CHECKPOINT_DIR = Path('./checkpoints')

# Create checkpoint directory
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)


In [21]:
# # Load preprocessed data splits
# train_df = pd.read_parquet(PROCESSED_DATA_DIR / 'train.parquet')
# val_df = pd.read_parquet(PROCESSED_DATA_DIR / 'val.parquet')
# test_df = pd.read_parquet(PROCESSED_DATA_DIR / 'test.parquet')

# # Load preprocessed small data splits
train_df = pd.read_parquet(PROCESSED_DATA_DIR / 'train.parquet')
val_df = pd.read_parquet(PROCESSED_DATA_DIR / 'val.parquet')
test_df = pd.read_parquet(PROCESSED_DATA_DIR / 'test.parquet')

print("Dataset sizes:")
print(f"Train: {len(train_df)} samples, {train_df['author'].nunique()} authors")
print(f"Val:   {len(val_df)} samples, {val_df['author'].nunique()} authors")
print(f"Test:  {len(test_df)} samples, {test_df['author'].nunique()} authors")

Dataset sizes:
Train: 625987 samples, 8953 authors
Val:   169654 samples, 1919 authors
Test:  102276 samples, 1919 authors


## English-Only Subset

Filtering the dataset to only include English comments to control for language effects and create a more focused analysis subset, as stylometric features dont directly carry over between languages.

In [22]:
# Filter for English tweets only
train_df = train_df[train_df['language'].isin(['en', 'English'])].copy()
val_df = val_df[val_df['language'].isin(['en', 'English'])].copy()
test_df = test_df[test_df['language'].isin(['en', 'English'])].copy()

print("English-only dataset sizes:")
print(f"Training:   {len(train_df):,} samples")
print(f"Validation: {len(val_df):,} samples") 
print(f"Test:       {len(test_df):,} samples")

English-only dataset sizes:
Training:   346,079 samples
Validation: 117,871 samples
Test:       48,815 samples


In [23]:
def make_author_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Returns a DataFrame with one row per author.
    Columns: 'text' (concatenated comments) and 'label' (mean trolliness)
    """
    out = (
        df.groupby('author')['text']
          .apply(lambda x: ' '.join(x))       # concatenate comments
          .to_frame(name='text')
    )
    out['label'] = (
        df.groupby('author')['troll']
          .mean()                             # author-level score
    )
    return out

# Create author-level datasets
train_authors = make_author_dataframe(train_df)
val_authors = make_author_dataframe(val_df)
test_authors = make_author_dataframe(test_df)

## Feature Engineering

We'll extract the following stylometric features from each text:
- Character count: Total number of characters
- Word count: Number of words
- Average word length: Mean length of words
- Capital letters count: Number of uppercase characters
- Number count: Number of numerical digits

These basic text statistics can potentially capture aspects of writing style that differentiate between normal and troll-like behavior.

In [24]:
def extract_stylometric_features(df):
    """
    Extract stylometric features from text data.
    
    Args:
        df: DataFrame containing 'text' column
        
    Returns:
        DataFrame with stylometric features
    """
    # Create copy to avoid modifying original
    features_df = df.copy()
    
    # Calculate features
    features_df['char_count'] = features_df['text'].str.len()
    features_df['word_count'] = features_df['text'].str.split().str.len()
    features_df['avg_word_length'] = features_df['text'].apply(lambda x: np.mean([len(word) for word in x.split()]))
    features_df['capital_letters'] = features_df['text'].apply(lambda x: sum(1 for c in x if c.isupper()))
    features_df['number_count'] = features_df['text'].apply(lambda x: sum(c.isdigit() for c in x))
    
    return features_df

# Extract features for each dataset
print("Extracting stylometric features...")
train_features = extract_stylometric_features(train_authors)
val_features = extract_stylometric_features(val_authors)
test_features = extract_stylometric_features(test_authors)

# Display sample of features
print("\nSample of extracted features from training set:")
print(train_features[['text', 'char_count', 'word_count', 'avg_word_length', 'capital_letters', 'number_count']].head())


Extracting stylometric features...

Sample of extracted features from training set:
                                                                                         text  \
author                                                                                          
000780d2cc151286ea3aadb1242e695257fedd4c23  : Ethnic cleansing defined Self-defence zionis...   
00170b5280ca848c4c52f43e5d6dec916e9cccb9a5  Cock blowing woman Look kinky slutty blonde dr...   
001a00e491921d7d9c0f7fd07d3bd6df396a23d0b4  So. Cernovich source story Rep Conyers harassm...   
00327d11dffbb29c24c5ef593d4aa8fc0ea2098a7d  believe God last week got tongue caught roller...   
003b41860dabce4ec6aa9b78bc2f2364f328dc6121  Boeing figuring make jet fuel tobacco Meet NYC...   

                                            char_count  word_count  \
author                                                               
000780d2cc151286ea3aadb1242e695257fedd4c23        2577         376   
00170b5280ca848c4c52f43e5d

## Experiment 1: Gradient Boosting with Stylometric Features

First baseline using only stylometric features:
1. StandardScaler for feature normalization
2. GradientBoostingRegressor with:
   - 400 estimators
   - 0.05 learning rate
   - Max depth of 3
   - 80% subsample rate

In [26]:
# Prepare features for SVR
feature_columns = ['char_count', 'word_count', 'avg_word_length', 'capital_letters', 'number_count']

# Create feature matrices
X_train = train_features[feature_columns].values
X_val = val_features[feature_columns].values 
X_test = test_features[feature_columns].values

# Get labels
y_train = train_features['label'].values
y_val = val_features['label'].values
y_test = test_features['troll'].values

In [None]:
model = Pipeline([
    ('scale', StandardScaler()),
    ('gbr',  GradientBoostingRegressor(
                n_estimators=400,
                learning_rate=0.05,
                max_depth=3,
                subsample=0.8,
                random_state=42))
])

model.fit(X_train, y_train)


In [None]:
for name, X, y in [('Train', X_train, y_train),
                   ('Val',   X_val,   y_val),
                   ('Test',  X_test,  y_test)]:
    y_hat = model.predict(X)
    print(f'{name:5s} | MSE {mean_squared_error(y, y_hat):.4f} '
          f'R² {r2_score(y, y_hat):.4f}')

## Experiment 2: Linear SVR with Stylometric Features

Testing Linear Support Vector Regression as an alternative approach:
- StandardScaler for feature normalization
- LinearSVR with C=1.0 and epsilon=0.1

## Author-Level Feature Aggregation

Aggregating features at the author level as that is how our task is defined - author level troll detection:
1. Grouping all comments by author
2. Computing mean stylometric features per author
3. Training models on these author-level aggregated features

In [36]:
# Create and train pipeline
svr_pipe = Pipeline([
    ('scale', StandardScaler()),
    ('svr', LinearSVR(C=1.0, epsilon=0.1, random_state=42, max_iter=50000, tol=1e-4))
])

svr_pipe.fit(X_train, y_train)

# Make predictions
train_pred = svr_pipe.predict(X_train)
val_pred = svr_pipe.predict(X_val)
test_pred = svr_pipe.predict(X_test)

# Print results
for split, y_true, y_hat in [
    ("Train", y_train, train_pred),
    ("Val  ", y_val, val_pred), 
    ("Test ", y_test, test_pred)
]:
    print(f"{split} | MSE {mean_squared_error(y_true, y_hat):.4f} "
          f"R² {r2_score(y_true, y_hat):.4f}")

Train | MSE 0.2808 R² -0.2525
Val   | MSE 0.2587 R² -0.2035
Test  | MSE 0.2587 R² -0.2036


## Author-Level TF-IDF Analysis

Final experiment using TF-IDF features at the author level:
- Concatenating all comments from each author
- Applying TF-IDF vectorization
- Training Ridge regression on the resulting features

In [38]:
# Create and train pipeline
author_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=50_000,
                              ngram_range=(1, 2),
                              min_df=3,
                              max_df=0.9,
                              dtype=float)),
    ('reg',   Ridge(alpha=1.0, random_state=42))
])

author_pipe.fit(train_authors['text'], train_authors['label'])

# Make predictions
train_pred = author_pipe.predict(train_authors['text'])
val_pred   = author_pipe.predict(val_authors['text'])
test_pred  = author_pipe.predict(test_authors['text'])

# Print results
for split, y_true, y_hat in [
    ("Train", train_authors['label'],  train_pred),
    ("Val  ", val_authors['label'],    val_pred), 
    ("Test ", test_authors['label'],   test_pred)
]:
    # Calculate metrics
    mse = mean_squared_error(y_true, y_hat)
    r2  = r2_score(y_true, y_hat)

    # BCE requires predictions in [0, 1]
    y_hat_clipped = np.clip(y_hat, 0, 1)
    bce = -np.mean(y_true * np.log(y_hat_clipped + 1e-10) +
                   (1 - y_true) * np.log(1 - y_hat_clipped + 1e-10))

    print(f"{split} | MSE {mse:.4f}  R² {r2:.4f}  BCE {bce:.4f}")



Train | MSE 0.0360  R² 0.8394  BCE 0.1677
Val   | MSE 0.0904  R² 0.5793  BCE 0.3181
Test  | MSE 0.0974  R² 0.5470  BCE 0.3507
