## Do I write like Shakespeare?
This notebook was written as an attempt to address the following problem. How do you know when what someone is writing is really from them, as opposed to copying someone else? $$ $$
One way to tell is to look at their word choice. Each person has a pattern of word choice and favorite expressions. My friend James likes to say “It beats a sharp stick in the eye!” when something is not too bad, my father-in-law: “Gimme a break!” when he thinks something is absurd. He also quotes Shakespeare. $$ $$
So, if we have 2 texts, say, one from Shakespeare and another from a lay person, by taking a sample of 20 words from what they say, we should be able to tell if they are trying to copy, or quote, Shakespeare or if it is their own speech production.
In this notebook, I use a technique called “Bag Of Words”, in which a particular text, or a particular sample, is a vector of word counts. $$ $$
So the first step is to find out the set of unique words, the vocabulary, the two texts use, and how many such unique words, called tokens, there are. $$ $$
Doing so requires the use of a tokenizer program. I use the keras.preprocessing.text one. $$ $$
In this notebook I also use Tensorflow. Keras uses Tensorflow as backend. $$ $$
If you do not have Tensorflow, there is a 10 dollar and easy way to do so, as long as you have an Android phone. The app is called Pydroid 3. Sorry Apple users, it is not available in the Apple store. You can install Tensorflow on your computer, but it tends to interfere with other programs, whereas the phone install gives you a binary that is compatible with other libraries you may want to install in the future. All the Python libraries you can dream of are available. $$ $$
For most of you, I included the tokenized versions of the two texts as files loadable with numpy, so you do not have to worry about the keras tokenizer. The section with the neural network can just be skipped if you do not have Tensorflow. $$ $$
#### This notebook is available at: https://github.com/pzirnhel/NewbyRepository.git

In [190]:
from keras.preprocessing.text import Tokenizer
import tensorflow as tf
import keras
import numpy as np
from sklearn.linear_model import LogisticRegression

In [191]:
tokenizer = Tokenizer()

data = open('tmp/CoriolanusAndOctavioShortNoPunctuationNoApostrophe.txt').read()

corpus = data.lower().split("\n")


tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

In [192]:
print(total_words)

734


There are 734 different words in the two texts.

In [193]:
# The word index is a dictionary that gives the word token number of the word
# This cell builds the opposite: idx_to_word gives the word knowing the index. It is useful for printing.
word_index = tokenizer.word_index
idx_to_word=[]
for word in word_index:
    idx_to_word.append(word)

In the next cell, we make a long string of all the words, ranked by frequency, most frequent first, there are in the two texts. 

In [194]:
all_words=""
for idx in range(len(idx_to_word)):
    all_words=all_words+idx_to_word[idx]+" "
print(all_words)

the to you it of a and citizen we in first for is with what not was that i he us all on this our one have be but s are they so would if as menenius she his or more did name units him these your belly speak know at an could do when image list representation t second them well say can must other tell like only then about each hear me let ll against why who will by answer long octavio her find want sutsbencuns ai input layer good poor were their give which even o now strong sir most up there off body look hornada just today how patel geoff professor line examples any resolved rather than people done patricians think too very hath mother cannot way need my where go had shall already care may make daily d see made from such chapter ocean perhaps left another last take going put two b get right use network given much its training goes before proceed further famish caius marcius no talking away citizens wholesome revenge gods has country content proud unto though soft please help here come ho

In [195]:
# This gives the number of lines the two texts put together has
lines=[]
for line in corpus:
    lines.append(line)
print(len(lines))

283


In [196]:
# This is the first line of "The Tragedy of Coriolanus"
print(lines[0])

first citizen


The purpose of the next cell is to identify the last line pertaining to Shakespeare in the two texts, in order to make two separate lists, one being the list of tokens of the Shakespeare text: “coriolanus_tokenized”, the other one my prose: “octavio_tokenized”.

In [197]:
for current_line_nb in range(190,201): #201 is the first line of the Octavio file
    current_line=lines[current_line_nb].lower().split(" ")
    print(len(current_line))
    for idx in range(len(current_line)):
        if current_line[idx] is not '':
            print(word_index[current_line[idx]])
            print(current_line[idx])

1
1
37
menenius
4
19
i
80
will
67
tell
3
you
11
35
if
3
you
76
ll
487
bestow
6
a
219
small
5
of
15
what
3
you
27
have
211
little
9
488
patience
489
awhile
3
you
76
ll
73
hear
1
the
48
belly
30
s
82
answer
1
2
11
first
8
citizen
5
490
ye
491
re
83
long
71
about
4
it
1
1
37
menenius
1


In [198]:
coriolanus_tokenized=[]
for current_line_nb in range(201): #201 is the first line of the Octavio file
    current_line=lines[current_line_nb].lower().split(" ")
    #print(len(current_line))
    for idx in range(len(current_line)):
        if current_line[idx] is not '':
            coriolanus_tokenized.append(word_index[current_line[idx]])

In [199]:
print(coriolanus_tokenized)

[11, 8, 168, 9, 169, 118, 170, 73, 74, 49, 22, 49, 49, 11, 8, 3, 31, 22, 119, 120, 2, 260, 121, 2, 171, 22, 119, 119, 11, 8, 11, 3, 50, 172, 173, 13, 261, 262, 2, 1, 122, 22, 9, 50, 59, 9, 50, 59, 11, 8, 75, 21, 263, 45, 7, 9, 76, 27, 264, 51, 25, 265, 266, 13, 59, 6, 267, 22, 174, 41, 175, 23, 59, 75, 4, 28, 123, 176, 176, 60, 8, 26, 268, 92, 177, 11, 8, 9, 31, 269, 93, 177, 1, 124, 92, 15, 270, 271, 23, 34, 272, 21, 35, 32, 34, 273, 21, 29, 1, 274, 275, 4, 94, 178, 9, 276, 277, 32, 278, 21, 279, 29, 32, 125, 9, 31, 126, 280, 1, 281, 18, 282, 21, 1, 283, 5, 25, 284, 13, 36, 52, 285, 2, 286, 95, 287, 25, 288, 13, 6, 289, 2, 61, 75, 21, 179, 24, 14, 25, 290, 291, 9, 292, 293, 12, 1, 180, 50, 19, 49, 24, 10, 294, 12, 295, 16, 10, 296, 12, 179, 60, 8, 34, 3, 169, 297, 77, 172, 173, 22, 77, 45, 11, 20, 30, 6, 127, 298, 2, 1, 299, 60, 8, 300, 3, 15, 301, 20, 181, 123, 12, 39, 182, 11, 8, 127, 62, 7, 53, 28, 183, 2, 96, 45, 92, 302, 303, 29, 18, 20, 304, 305, 14, 306, 184, 60, 8, 307, 29, 49

In [200]:
octavio_tokenized=[]
for current_line_nb in range(201,283): #201 is the first line of the Octavio file
    current_line=lines[current_line_nb].lower().split(" ")
    #print(len(current_line))
    for idx in range(len(current_line)):
        if current_line[idx] is not '':
            octavio_tokenized.append(word_index[current_line[idx]])

In [201]:
print(octavio_tokenized)

[1, 492, 493, 5, 84, 109, 148, 26, 84, 30, 494, 84, 17, 495, 10, 496, 497, 39, 498, 129, 65, 27, 499, 126, 93, 2, 500, 501, 7, 2, 502, 45, 20, 17, 220, 10, 6, 221, 23, 6, 503, 51, 504, 505, 506, 1, 507, 5, 207, 1, 508, 5, 509, 10, 510, 18, 511, 1, 512, 513, 1, 514, 149, 150, 39, 129, 515, 85, 222, 34, 28, 223, 81, 6, 516, 40, 26, 5, 46, 205, 517, 518, 519, 520, 51, 1, 521, 5, 522, 38, 136, 110, 151, 1, 221, 14, 6, 523, 23, 4, 84, 109, 15, 6, 524, 43, 84, 150, 24, 17, 85, 525, 526, 7, 38, 53, 16, 86, 152, 41, 527, 528, 43, 40, 150, 38, 529, 38, 34, 27, 2, 75, 45, 135, 7, 78, 42, 38, 530, 6, 153, 43, 78, 110, 26, 153, 43, 17, 4, 39, 531, 30, 43, 7, 38, 34, 16, 96, 532, 6, 43, 13, 147, 52, 533, 534, 1, 535, 536, 40, 1, 224, 34, 537, 538, 539, 5, 122, 540, 109, 4, 34, 541, 225, 2, 85, 29, 4, 17, 16, 226, 1, 224, 42, 108, 1, 43, 104, 109, 13, 16, 215, 1, 542, 53, 16, 86, 79, 53, 27, 543, 1, 544, 24, 153, 43, 225, 545, 1, 546, 42, 154, 6, 83, 212, 186, 33, 1, 43, 547, 1, 227, 20, 17, 223, 1,

In [202]:
print(len(coriolanus_tokenized),len(octavio_tokenized))

1010 1145


The first text, the first five pages of "The Tragedy of Coriolanus" written by Skakespeare is a list of 1010 tokens. $$ $$
The second, the first five pages of a would be novel called "The Multiple Lives of Octavio Hornada" written by me is a list of 1145 tokens. Close enough. $$ $$
In the next cell, I save these lists and loads them back. I commented all of it. If you do not have the keras tokenizer available, you can simply un-comment the two load instructions, to get the lists.

In [203]:
#np.save('tmp/coriolanus_tokenized',np.array(coriolanus_tokenized))
#np.save('tmp/octavio_tokenized',np.array(octavio_tokenized))
#coriolanus_tokenized=list(np.load('tmp/coriolanus_tokenized.npy'))
#octavio_todenized=list(np.load('tmp/octavio_tokenized.npy'))

In [204]:
# Each array is shuffled, in order to extract random samples from them
coriolanus_shuffled=np.array(coriolanus_tokenized)
np.random.shuffle(coriolanus_shuffled)
octavio_shuffled=np.array(octavio_tokenized)
np.random.shuffle(octavio_shuffled)

The next step is to extract 50 “lists” of 20 tokens from each numpy array: that will give us a total of 100 samples. $$ $$
They are not really lists, simply rows of two 50 by 20 numpy arrays. $$ $$
50 times 20 = 1000. We only extract 1000 tokens from the two texts and discard the remainder. Because we shuffled the tokenized texts, the samples are random. 

In [205]:
input_dim=20
n_samples=2000//input_dim
X_coriolanus_idx=np.reshape(coriolanus_shuffled[:1000],[n_samples//2,input_dim])
X_octavio_idx=np.reshape(octavio_shuffled[:1000],[n_samples//2,input_dim])
print(n_samples)

100


Now it is time to convert the samples from lists of tokens to vectors of word counts. $$ $$
It turned out the most common five words were a source of noise in this classification task. $$ $$
So the vectors do not include these five features: that is if the sample contains one of these frequent words, we do not count it. This creates vectors with 729 dimensions instead of 734.

In [206]:
freq_words=7
last_words=total_words-freq_words
X_coriolanus=np.zeros([n_samples//2,last_words])
X_octavio=np.zeros([n_samples//2,last_words])
for sample in range(n_samples//2):
    for j in range(input_dim):
        word_idx = X_coriolanus_idx[sample,j]-freq_words
        if word_idx>=0:
            X_coriolanus[sample,word_idx]+=1.0
        word_idx = X_octavio_idx[sample,j]-freq_words
        if word_idx>=0:
            X_octavio[sample,word_idx]+=1.0

In [207]:
# This prints an example of word count vectors: print(X_coriolanus[0,:])
# The vector is very sparse: lots of zeros
print(X_octavio[0,:])

[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

The next six cells are about organizing the data into training and test sets along with their corresponding labels: 1 for Shakespeare, 0 for not Shakespeare. $$ $$
First, we get a random ordering of numbers from 0 to 99. We will use this ordering to shuffle the data the same way we shuffle the labels, so each sample gets its correct corresponding label. $$ $$
Second we get a numpy array of 50 ones and 50 zeros, which are the correct labels before shuffling. $$ $$
Third, we shuffle the data and the labels the same way. $$ $$
Fourth, we split the data into training and test sets: 90 training samples, 10 test samples. $$ $$
Fifths and sixth cells are to verify the data have the proper shapes and includes both positive and negative samples. 

In [208]:
ordering=np.arange(n_samples)
np.random.shuffle(ordering)
print(ordering.dtype)

int32


In [209]:
y_base=np.reshape(np.concatenate([np.ones([1,n_samples//2]),np.zeros([1,n_samples//2])],axis=1),[n_samples])
print(y_base.shape)

(100,)


In [210]:
X=np.concatenate([X_coriolanus,X_octavio],axis=0)[ordering,:]
y=y_base[ordering]
print(X.shape,y_base.shape)

(100, 727) (100,)


In [211]:
X_train,X_test=X[:(9*n_samples)//10,:],X[(9*n_samples)//10:n_samples,:]
y_train,y_test=y[:(9*n_samples)//10],y[(9*n_samples)//10:n_samples]

In [212]:
print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

(90, 727) (10, 727)
(90,) (10,)


In [213]:
print(y_test)

[0. 0. 1. 1. 1. 0. 1. 0. 0. 1.]


#### Neural Network Implementation
The next three cells are about setting up and training a logistic regressor implemented as a neural network. If you do not have Tensorflow, you can just skip running these 3 cells. $$ $$
I decided to train a neural network, because initially, I did not know if a logistic regressor would suffice and wanted the flexibility to modify my model: add layers to see if I could get the model to detect a signal: a difference between the two distributions of word counts. It was not a given that the  data would be linearly separable, although it could be suspected on the basis of the sparseness of the feature vectors. $$ $$
Neural networks with several layers can handle non-linearly separable data.

In [214]:
# Callbacks are the way to modify training without changing the model
# on_epoch ends stops the training, in order to prevent overfitting when a given training accuracy is reached.
class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get('acc')>0.99):
            print("\nReached 99% accuracy so cancelling training!")
            self.model.stop_training = True

callbacks = myCallback()

In [215]:
# Similarly to sklearn, in Keras, we specify a model: this one has one fully connected layer "Dense".
# The output of the Dense layer is fed to the sigmoid function of a single unit
model = keras.models.Sequential([keras.layers.Dense(1, activation=tf.nn.sigmoid)])
# Keras uses the Python line above to build a graph that is compiled in C++ and will not be executed in Python
# In building this graph, it also needs to know what optimization algorithm to use: Adam for adaptive momentum,
# what loss function to minimize and optionally the metric we are intested in
model.compile(optimizer = 'adam',loss = 'binary_crossentropy',metrics=['accuracy'])
# This is the training phase:
# No need to feed it batches and to shuffle, no need to split between training and
# validation data sets: keras does that for you; just set hyperparameters
# fit returns a history: an object containing a dictionary of the evolution of training:
# training loss, validation loss, training accuracy, validation accuracy

# Using the callback to stop training early is good to prevent the network for overfitting
#history=model.fit(X_train, y_train, batch_size=10, epochs=100,callbacks=[callbacks],validation_split=1.0/9.0,shuffle=True)
# Here the network is underfitting, so it is best to train longer.
history=model.fit(X_train, y_train, batch_size=10, epochs=50,validation_split=1.0/9.0,shuffle=True)
model.evaluate(X_test, y_test)

Train on 80 samples, validate on 10 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


[0.4022069573402405, 1.0]

In [216]:
print(y_train)

[0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0.
 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1.
 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1.
 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0.]


#### Sklearn implementation
This is the same type of logistic regressor, implemented with sklearn

In [217]:
LogReg = LogisticRegression(class_weight='balanced',solver='liblinear').fit(X_train,y_train.ravel())

In [218]:
y_pred = LogReg.predict(X_test)
print("Accuracy:")
print(np.mean(np.float32(np.equal(y_pred,y_test))))

Accuracy:
1.0
