# Siamese Networks for Authorship Verification

Siamese Networks were first used decades ago for Signature Verification, and with the new interest in neural networks and deep learning, they are more recently being used for all kinds of verification tasks. Instead of teaching the network to recognise examples of a specific class by giving it lots of labeled examples of that specific class, you instead have it learn a distance function between pairs of examples. It learns to tell if two examples are from the same class or from different classes.

For example, image re-identification is used for automatic access control. A single photo of a Alice as a reference or ID photo. When Alice wants to access the building, a new photo is taken at the door and compared to her reference photo. The door opens if the algorithm detects a match. With a simple classification neural net, we'd need to show the net many examples of Alice, in different poses, wearing different clothing, with her face at different angles. A siamese network can detect Alice even if has never seen her before. Instead of learning what Alice looks like, it learns to tell whether to photos are of the same person by learning a distance function between pairs of photos.

We show here how a Siamese Network can be used for authorship verification -- given two texts, it predicts whether or not they are written by the same author, even if has never seen other texts by that author.

Most of the code for the Siamese Network comes from here: https://github.com/fchollet/keras/blob/master/examples/mnist_siamese_graph.py

In [2]:
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion

from __future__ import absolute_import
from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

import random
from keras.datasets import mnist
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Input, Lambda
from keras.optimizers import RMSprop
from keras import backend as K

from collections import Counter
from sklearn.metrics import classification_report

Using TensorFlow backend.


In [3]:
# paths to PAN datasets, available from http://pan.webis.de/clef15/pan15-web/author-identification.html
# and http://pan.webis.de/clef14/pan14-web/author-identification.html
pan15train = "/data/panstuffs/pan15-authorship-verification-training-dataset-english-2015-04-19/"
pan15test = "/data/panstuffs/pan15-authorship-verification-test-dataset2-english-2015-04-19/"
pan14train = "/data/panstuffs/pan14-author-verification-training-corpus-english-novels-2014-04-22/"
pan14test = "/data/panstuffs/pan14-author-verification-test-corpus2-english-novels-2014-04-22/"
pan14train_e = "/data/panstuffs/pan14-author-verification-training-corpus-english-essays-2014-04-22/"
pan14test_e = "/data/panstuffs/pan14-author-verification-test-corpus2-english-essays-2014-04-22/"

In [4]:
def read_file(filepath):
    with open(filepath) as f:
        s = f.read()
    return s

def load_pan_data(directory, prefix="E"):
    """Load known and unknown texts in the PAN data format"""
    # FIXME: assumes one known file per author, which is fine for English datasets only
    authors = sorted([x for x in os.listdir(directory) if x.startswith(prefix)])
    known_texts = []
    unknown_texts = []
    for author in authors:
        kf = os.path.join(directory, author, "known01.txt")
        uf = os.path.join(directory, author, "unknown.txt")
        known_texts.append(read_file(kf))
        unknown_texts.append(read_file(uf))
        
    truthfile = os.path.join(directory, "truth.txt")
    with open(truthfile) as f:
        lines = f.read().strip().split("\n")
    y = [1 if line.split()[1] == "Y" else 0 for line in lines]
    y = np.array(y)
    return known_texts, unknown_texts, y

In [5]:
tr_known, tr_unknown, tr_labels = load_pan_data(pan15train)
te_known ,te_unknown, te_labels = load_pan_data(pan15test)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC, LinearSVC
from sklearn.pipeline import FeatureUnion

In [33]:
svm = SVC(kernel='linear', probability=True)
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

svm = OneVsRestClassifier(SVC(kernel='linear', probability=True, C=5))
svm = MultinomialNB(alpha=0.0001)

In [34]:
tr_known, te_known = te_known, tr_known
tr_labels, te_labels = te_labels, tr_labels
tr_unknown, te_unknown = te_unknown, tr_unknown
print(len(tr_known), len(tr_labels), len(tr_unknown))

500 500 500


In [35]:
cvec = TfidfVectorizer(analyzer='char', ngram_range=(2,5), min_df=0.1, lowercase=False, binary=True, sublinear_tf=True, use_idf=True)
wvec = TfidfVectorizer(ngram_range=(1,3), min_df=0.01, lowercase=False, binary=True, sublinear_tf=True, use_idf=True)
vec = FeatureUnion([
    ('word', wvec),
    ('char', cvec)
])
vec.fit(te_known + te_unknown)

trk_vecs = vec.transform(te_known)
tru_vecs = vec.transform(te_unknown)

In [36]:
svm.fit(trk_vecs, list(range(len(te_known))))

MultinomialNB(alpha=0.0001, class_prior=None, fit_prior=True)

In [37]:
preds = svm.predict_proba(tru_vecs)

In [38]:
def in_top_half(predictions, index):
    return index in [x[1] for x in sorted([(v,i) for i, v in enumerate(preds[index])], reverse=True)][:int(len(preds[0])/2)]

bin_preds = [1 if in_top_half(preds, i) else 0 for i in range(len(preds))]

In [39]:
from sklearn.metrics import accuracy_score
accuracy_score(te_labels, bin_preds)

0.60999999999999999