<center><h2>Stylometry with John Burrows’ Delta Method using Python</h2></center>

__Stylometry__ is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognizable and unique ways. For example:

* Each person has their own unique vocabulary, sometimes rich, sometimes limited. Although a larger vocabulary is usually associated with literary quality, this is not always the case.
* Some people write in short sentences, while others prefer long blocks of text consisting of many clauses.
* No two people use semicolons, em-dashes, and other forms of punctuation in the exact same way.

However, one of the most common applications of stylometry is in authorship attribution. Given an anonymous text, it is sometimes possible to guess who wrote it by measuring certain features, like the average number of words per sentence or the propensity of the author to use “while” instead of “whilst”, and comparing the measurements with other texts written by the suspected author. This is what we will be doing in this lesson. Here, we will apply __John Burrows’ Delta Method__ to infer authorship of an anonymous text or set of texts.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk import download

## Load Data
Here, we will conduct our experiment with a corpus that contains 1000 tweets and the number of unique user was 10.

In [2]:
data = pd.read_csv("data/TDS.csv")

In [3]:
data.head()

Unnamed: 0,user,profession,tweet
0,@TomCruise,Actor,Thank you to all the fans who came out to Hall...
1,@TomCruise,Actor,Thank you to everyone who supported Mission: I...
2,@TomCruise,Actor,"2,000 feet, 2,000 people, 4 hours of hiking. T..."
3,@TomCruise,Actor,It was an honor to be recognized at #CinemaCon...
4,@TomCruise,Actor,Thank you to the amazing people of New Zealand...


In [4]:
data.shape

(1000, 3)

In [5]:
data = data.drop_duplicates()

## Data Separation

In [6]:
X = data.tweet
y = data.user

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)

data_train = pd.DataFrame()
data_train["tweet"] = X_train
data_train["user"] = y_train

data_test = pd.DataFrame()
data_test["tweet"] = X_test
data_test["user"] = y_test

## Preprocessing

In [7]:
from nltk import word_tokenize, sent_tokenize
download('punkt')

def preprocess(text):
    tokens = word_tokenize(text)
    preprocessed_text = [token for token in tokens if any(c.isalpha() for c in token)]
    return preprocessed_text

data_train['tweet'] = data_train['tweet'].apply(preprocess)

data_train = pd.DataFrame(data_train['tweet'].groupby(data_train['user']).sum()).reset_index()
data_test = pd.DataFrame(data_test['tweet'].groupby(data_test['user']).sum()).reset_index()

[nltk_data] Downloading package punkt to /home/silicon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Feature Selection

In [8]:
whole_corpus = []
for i in data_train.tweet:
    whole_corpus += i

In [9]:
whole_corpus_freq_dist = list(nltk.FreqDist(whole_corpus).most_common(30))
#whole_corpus_freq_dist[ :30 ]

In [10]:
features = [word for word, freq in whole_corpus_freq_dist]

## Calculating features for each subcorpus

In [11]:
feature_freqs = {}

for i, row in data_train.iterrows():
    
    feature_freqs[row.user] = {}
    
    overall = len(row.tweet)
    
    for feature in features:
        presence = row.tweet.count(feature)
        feature_freqs[row.user][feature] = presence / overall

## Calculating feature averages and standard deviations

In [12]:
import math

# The data structure into which we will be storing the "corpus standard" statistics
corpus_features = {}

# For each feature...
for feature in features:
    # Create a sub-dictionary that will contain the feature's mean 
    # and standard deviation
    corpus_features[feature] = {}
    
    # Calculate the mean of the frequencies expressed in the subcorpora
    feature_average = 0
    for i, row in data_train.iterrows():
        feature_average += feature_freqs[row.user][feature]
    feature_average /= len(data_train.user)
    corpus_features[feature]["Mean"] = feature_average
    
    # Calculate the standard deviation using the basic formula for a sample
    feature_stdev = 0
    for i, row in data_train.iterrows():
        diff = feature_freqs[row.user][feature] - corpus_features[feature]["Mean"]
        feature_stdev += diff*diff
    feature_stdev /= (len(data_train.user) - 1)
    feature_stdev = math.sqrt(feature_stdev)
    corpus_features[feature]["StdDev"] = feature_stdev

__Equation for the z-score statistic:__
$$ Z_i = \frac{C_i - \mu_i}{\sigma_i}$$
This is the equation of z-score for feature $i$, where $C_i$ represents the observed frequency, the greek letter $\mu$ represents the mean of means, and the greek letter $\sigma$, the standard deviation.

__Equation for John Burrows’ Delta statistic:__
$$ \Delta_c = \sum_i \frac{\mid Z_c(i) - Z_t(i) \mid}{n} $$
where, $Z_c(i)$ is the z-score for feature $i$ in author $c$, and $Z_t(i)$ is the z-score for feature $i$ in the test case.

## Calculating z-scores

In [13]:
feature_zscores = {}

for i, row in data_train.iterrows():
    feature_zscores[row.user] = {}
    
    for feature in features:
        # Z-score definition = (value - mean) / stddev
        feature_val = feature_freqs[row.user][feature]
        feature_mean = corpus_features[feature]["Mean"]
        feature_stdev = corpus_features[feature]["StdDev"]
        feature_zscores[row.user][feature] = ((feature_val-feature_mean) / feature_stdev)

## Calculating features and z-scores for test case

In [14]:
def testing(test_case):
    # Tokenize the test case
    testcase_tokens = nltk.word_tokenize(test_case)

    # Filter out punctuation and lowercase the tokens
    testcase_tokens = [token.lower() for token in testcase_tokens if any(c.isalpha() for c in token)]
    

    # Calculate the test case's features
    overall = len(testcase_tokens)
    testcase_freqs = {}
    for feature in features:
        presence = testcase_tokens.count(feature)
        testcase_freqs[feature] = presence / overall

    # Calculate the test case's feature z-scores
    testcase_zscores = {}
    for feature in features:
        feature_val = testcase_freqs[feature]
        feature_mean = corpus_features[feature]["Mean"]
        feature_stdev = corpus_features[feature]["StdDev"]
        testcase_zscores[feature] = (feature_val - feature_mean) / feature_stdev

    ## Calculating Delta
    test_result = {}
    for i, row in data_train.iterrows():
        delta = 0
        for feature in features:
            delta += math.fabs((testcase_zscores[feature] - feature_zscores[row.user][feature]))
        delta /= len(features)
        test_result[row.user] = delta

    # min_val = sorted(list(test_result.values()))[:2]
    # author = [k for k, v in test_result.items() if v == min_val[0] or v == min_val[1]]
    
    min_val = min(test_result.values())
    author = [k for k, v in test_result.items() if v == min_val]
    return(author)

In [15]:
data_test.user = data_test.user.astype(str)
result = []
for test_case in data_test['tweet']:
    result.append(testing(test_case)[0])

data_test["result"] = result

## Evaluation

In [16]:
accuracy = (data_test[data_test.user == data_test.result].shape[0]) / (data.user.nunique())
print("accuracy: ", accuracy*100,'%')

accuracy:  90.0 %
