# Multi Author Writing Style Analysis
by: Noah Syrkis


As a writer, I have always been interested in the idea of a "writing style".
I have always wondered if there was a way to quantify the style of a writer, and if so, how would one go about doing so?
This project is an attempt to build a model that does precisely that.

In [1]:
from sentence_transformers import SentenceTransformer
from collections import Counter
import numpy as np
import random
import pickle
import torch
from tqdm import tqdm
from functools import partial
from sklearn.metrics import f1_score
from src.utils import get_data

## data

In [2]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
# data = { str(i): get_data(i, model.encode) for i in range(1, 4) }
# pickle.dump(data, open('data/data.pkl', 'wb'))
data = pickle.load(open('data/data.pkl', 'rb'))


In [3]:
def paired_samples(data_split):
    """turns data set into pair of consectuve sentences (flattens multi paragraph samples into pairs)"""
    pairs = []
    for problem_id in data_split.keys():
        texts = data_split[problem_id]['text']
        targets = data_split[problem_id]['truth']['changes']
        if len(texts) - 1 != len(targets):
            # TODO: fix. a few of the samples have more than one paragraph, making .readlines() wrong
            # print(f'problem {problem_id} has {len(texts)} texts and {len(targets)} targets')
            continue
        for target, text1, text2 in zip(targets, texts[:-1], texts[1:]):
            pairs.append((text1, text2, target))
    random.shuffle(pairs)
    return pairs

## exploration

We explore the data using the paired approach and see that,
the data is unbalanced in dataset 1,
but well balanced in datasets 2 and 3:

In [4]:
for dataset_id in range(1, 4):
    print(f'dataset {dataset_id}')
    train_pairs = paired_samples(data[str(dataset_id)]['train'])
    valid_pairs = paired_samples(data[str(dataset_id)]['validation'])
    print(f'train: {Counter([p[2] for p in train_pairs])}')
    print(f'valid: {Counter([p[2] for p in valid_pairs])}')
    print()

dataset 1
train: Counter({1: 11340, 0: 1554})
valid: Counter({1: 2449, 0: 377})

dataset 2
train: Counter({0: 15001, 1: 13215})
valid: Counter({0: 3994, 1: 3019})

dataset 3
train: Counter({0: 10087, 1: 9017})
valid: Counter({0: 2159, 1: 1953})



## models

In [27]:
def get_batches(pairs, batch_size):
    while True:
        perm = np.random.permutation(len(pairs))
        for i in range(0, len(pairs), batch_size):
            batch = [pairs[j] for j in perm[i:i+batch_size]]
            x1, x2, y = zip(*batch)
            yield x1, x2, y

baseline f1 score on train: 0.6379281214549491
