In [1]:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

from fasterstylometry import Corpus, BurrowsDelta

## Loading Books

We can load in some sample books from project gutenberg. Into the reference set we will load three works by Jane Austen and three works by Shakespeare but two of the books will be different editions of the same story. The test set will contain one work by Jane Austen, one by Shakespeare and one by Charles Dickens.

Depending on how Gutenbergs mirrors are faring when you try to download the books, you may have to select a [mirror from their list](https://www.gutenberg.org/MIRRORS.ALL).

In [2]:
# Change mirror here if necessary
MIRROR = 'http://gutenberg.pglaf.org/'

# Load in the training books
train_ids = [1342, 161, 158, 1513, 27761, 1524]
train_data = {
    'authors': ['Jane Austen', 'Jane Austen', 'Jane Austen', 'William Shakespeare', 'William Shakespeare', 'William Shakespeare'],
    'titles': ['Pride and Prejudice', 'Sense and Sensibility', 'Emma', 'Romeo and Juliet', 'Hamlet', 'Hamlet'],
    'texts': [strip_headers(load_etext(id, mirror=MIRROR)).strip() for id in train_ids]
}

# Load in the testing books
test_ids = [105, 23042, 1400]
test_data = {
    'authors': ['Jane Austen', 'William Shakespeare', 'Charles Dickens'],
    'titles': ['Persuasion', 'The Temptest', 'Great Expectations'],
    'texts': [strip_headers(load_etext(id, mirror=MIRROR)).strip() for id in test_ids]
}

## Create Corpora and Initialise Burrows Delta

Create the train and test corpora, making sure to pass the top_k_tokens form the train corpus to the test corpus when initialising.

The Burrows Delta object can be instantiated using these two corpora as parameters.

Instantiating a corpus does not inistalise any calcualtions. The underlying datastructure for the corpus is a LazyFrame which will only evaluate once a statistic is required.

In [3]:
# Create train and test corpora
train_corpus = Corpus(**train_data)
test_corpus = Corpus(**test_data, top_k_tokens=train_corpus.top_k_tokens)

# Instantiate BurrowsDelta object
burrows_delta = BurrowsDelta(train_corpus, test_corpus)

## Generating Delta-Based Metrics

There are a few statistics that can be calcualted from the Burrows-Delta object. The full set of deltas for train-test document-wise permutations can be returned using the `document_deltas` property. 

The other statistic is `author_deltas` which calulates the Burrows-Delta between each test document and unique author in the train set. This is useful for author attribution.

Both methods will cause the LazyFrames to collect. Until one of these properties is called, no calculations are performed.

`NOTE:` the `document_deltas` requires a lot more memory and computation than `author_deltas`.

In [4]:
burrows_delta.document_deltas

index,author,title,index_test,author_test,title_test,burrows_delta
str,str,str,str,str,str,f64
"""59cdb89e-27a7-498f-bfc3-bdeb7c…","""William Shakespeare""","""Hamlet""","""fa0170c2-7457-47a5-bfe2-968185…","""William Shakespeare""","""The Temptest""",0.37244
"""ab6596d7-cc5c-4be6-9484-40adff…","""William Shakespeare""","""Romeo and Juliet""","""fa0170c2-7457-47a5-bfe2-968185…","""William Shakespeare""","""The Temptest""",0.376282
"""fdb42b9d-cd9c-48d3-bf43-0a9e87…","""William Shakespeare""","""Hamlet""","""fa0170c2-7457-47a5-bfe2-968185…","""William Shakespeare""","""The Temptest""",0.403112
"""96b5c287-35d4-414e-b765-ed348c…","""Jane Austen""","""Pride and Prejudice""","""76eea703-047b-4b73-97de-447456…","""Charles Dickens""","""Great Expectations""",0.602278
"""da0da011-f2f3-4473-a06b-c029bf…","""Jane Austen""","""Sense and Sensibility""","""76eea703-047b-4b73-97de-447456…","""Charles Dickens""","""Great Expectations""",0.65148
…,…,…,…,…,…,…
"""96b5c287-35d4-414e-b765-ed348c…","""Jane Austen""","""Pride and Prejudice""","""fa0170c2-7457-47a5-bfe2-968185…","""William Shakespeare""","""The Temptest""",1.397659
"""fdb42b9d-cd9c-48d3-bf43-0a9e87…","""William Shakespeare""","""Hamlet""","""76eea703-047b-4b73-97de-447456…","""Charles Dickens""","""Great Expectations""",1.456683
"""ab6596d7-cc5c-4be6-9484-40adff…","""William Shakespeare""","""Romeo and Juliet""","""76eea703-047b-4b73-97de-447456…","""Charles Dickens""","""Great Expectations""",1.543043
"""59cdb89e-27a7-498f-bfc3-bdeb7c…","""William Shakespeare""","""Hamlet""","""76eea703-047b-4b73-97de-447456…","""Charles Dickens""","""Great Expectations""",1.558848


In [5]:
burrows_delta.author_deltas

author_train,"{""76eea703-047b-4b73-97de-44745612a4ea"",""Great Expectations"",""Charles Dickens""}","{""b244396a-5cac-4425-ab31-2c87f5ad5ec0"",""Persuasion"",""Jane Austen""}","{""fa0170c2-7457-47a5-bfe2-968185e98aa8"",""The Temptest"",""William Shakespeare""}"
str,f64,f64,f64
"""Jane Austen""",0.396946,0.760167,1.655521
"""William Shakespeare""",1.573442,0.935512,0.223838


## Comparison with FastStylometry

In [17]:
from faststylometry import Corpus as fs_Corpus
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta

# Load and tokenise training corpus
fs_train_corpus = fs_Corpus()
for author, text, title in zip(*train_data.values()):
    fs_train_corpus.add_book(author, title, text)

fs_train_corpus.tokenise(tokenise_remove_pronouns_en)

# Load and tokenise test corpus
fs_test_corpus = fs_Corpus()
for author, text, title in zip(*test_data.values()):
    fs_test_corpus.add_book(author, title, text)

fs_test_corpus.tokenise(tokenise_remove_pronouns_en)

# Calculate Burrows' Delta - equivalent to author_deltas
calculate_burrows_delta(fs_train_corpus, fs_test_corpus)

Unnamed: 0,Charles Dickens - Great Expectations,Jane Austen - Persuasion,William Shakespeare - The Temptest
Jane Austen,3.146572,1.204202,2.765183
William Shakespeare,3.300639,2.16015,1.81896
