[View in Colaboratory](https://colab.research.google.com/github/tornermarton/deep_learning_project/blob/master/dl_project_loremimpsum.ipynb)

# Authorship identification using deep learning
**Füleki Fábián,	Jani Balázs Gábor,	Torner Márton**  
*Project work for BME Deep Learning course (VITMAV45),  
Team: LoremIpsum*


**Milestone I**

**Dataset:**  
Our primary dataset is the Reuters_50_50 (C50), which is a subset of Reuters Corpus Volume I(RCVI). The RCV1 is archive of categorized newswire stories, made public for research purposes by Reuters, Ltd. The C50 collection consist of 50 texts for each of the 50 top author, for training and separately the same amount for testing purpose (5000 texts in total). This dataset has been previous used by previous studies of authorship recognition and can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip



In [1]:
# Clean storage for new files
!rm -r C50*

# Download of the Reuter_50_50 (C50) dataset
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip"
!unzip -q C50.zip

# Download contains 2 directories split, merge them (we will do custom splitting)
!mkdir C50
!mv C50train/* C50/
!rsync -a C50test/ C50/

# Clean files we don't need
!rm C50.zip
!rm -r C50train
!rm -r C50test

--2018-10-14 21:33:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8194031 (7.8M) [application/zip]
Saving to: ‘C50.zip’


2018-10-14 21:33:50 (8.15 MB/s) - ‘C50.zip’ saved [8194031/8194031]



In [2]:
# Download and install the largest language pack for SpaCy
# It contains 1 000 000 word vectors (so only very rare words can't be processed)
!python -m spacy download en_core_web_lg


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_lg -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_lg

    You can now load the model via spacy.load('en_core_web_lg')



In [3]:
# Get required resources
import spacy
import math
import time
import pandas as pd
import nltk
import os
import numpy as np
from nltk import tokenize

nltk.download('punkt')
pd.set_option("max_columns", None)
nlp = spacy.load('en_core_web_lg', disable=['ner','parser'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
def read_sentences_from_file(author, author_id, filename):
  data = ""
  sentences = []

  # parse file
  with open("C50/"+author+"/"+filename, 'r') as file:
      data=file.read()
      
  # split article into sentences
  for sentence in tokenize.sent_tokenize(data):
    # if a sentence is very short (happens e.g. after a quote 'he said.') we leave it out
    if len(tokenize.word_tokenize(sentence)) > 7:
      sentences.append([author_id, sentence])
  
  return sentences

In [0]:
# array which contains the authors' names
authors = []

def load_raw_sentences():
  raw_sentences = []
  authors = []
  # read every file (articles) in the previously given root directory, the subdirectories are the authors' names
  for root, dirs, files in os.walk("C50"):
    for dir in dirs:
      authors.append(dir)
      for file in os.listdir("C50/"+dir):
        raw_sentences.extend(read_sentences_from_file(dir, len(authors)-1, file))

  return raw_sentences

**Sentence parsing**

We parse all the sentences with SpaCy in the followig way:

1. Tokenize the sentence (split into words - in SpaCy the punctuation characters also count as words, but we remove them later, because they do not contain needed information)

2. Get the vector form of each word, if it is not part of the largest collection (very rare words) we leave them out, because we can only use vectors for the inputs.

3. Detect for each word which part of the sentence it is (part-of-speech tags - syntactic information)


In [0]:
def parse_raw_sentences(sentences, verbose=False):
  # just for writing out fancy things
  if verbose:
    start_time = time.time()
    
  parsed_sentences = np.empty([len(sentences)], dtype=[('author', object, 1), ('original', object, 1), ('parsed', object, 1)])
  
  # parse every sentence (word splitting -> tokens, determine part-of-speech tags for every word)
  for i in range(0, len(sentences)):
    author = sentences[i][0]
    raw_sentence = sentences[i][1]
    parsed = np.array([], dtype=[('text', object, 1), ('vector', object, 1), ('pos_str', object, 1), ('pos_num', object, 1)])
    
    doc = nlp(raw_sentence)
   
    for token in doc:
      # filter out stop words (not relevant/useful)
      # 96 = punctuation char (->SpaCy documentation)
      # if a word does not have vector form filter it out (very, very rare case)
      if not token.is_stop and not token.pos == 96 and token.has_vector:
        parsed = np.append(parsed, np.array((token.text, token.vector, token.pos_, token.pos), dtype=[('text', object, 1), ('vector', object, 1), ('pos_str', object, 1), ('pos_num', object, 1)]))
    
    parsed_sentences[i] = (author, raw_sentence, parsed)
    
    if verbose and (i+1)%1000 == 0:
      print(str(i+1)+" sentences parsed in " +str(round(time.time() - start_time))+ " seconds.")
  
  if verbose:
    print(str(len(sentences)) + " sentences successfully parsed.")
    
  
  return parsed_sentences

In [0]:
def count_avg_sentence_len(sentences):
  sum = 0
  count = 0
  for sentence in sentences['parsed']:
    sum += len(sentence)
    count += 1
  
  return sum/count

**Equalization of the sentences in the dataset**

We plan to use sentence based identification so out system needs sentences which have equal lengths (word count), but obviously the articles are not written in this way, so we have to make the equalization. We calculate the average word count of the sentences in the dataset and then we transform all of them to contain the same number of words (we round up the average to keep more sentences in the full form).

Too short sentences are extended with wildcard (magic) words which will be filtered out in a way in the learning process.

Too long sentences are simply cut to shape.

In [0]:
def equalize_sentence_len(sentences):
  # calculate the average sentence length and round it up (we try to keep most of the sentences)
  avg = math.ceil(count_avg_sentence_len(sentences))
  
  # process 
  for i in range(0, len(sentences['parsed'])):
    sentence = sentences['parsed'][i]
    
    # 'magic words' : text = Xxxxxx ; vector=nullvector ; pos_tag='' ; pos_tag number form : 0
    # insert magic word into random positions for every sentence, which is too short (shorter, than average)
    while len(sentence) < avg:
      idx = np.random.randint(len(sentence))
      sentence = np.insert(sentence, idx, np.array(("Xxxxxx", np.zeros(300), "", 0), dtype=[('text', object, 1), ('vector', object, 1), ('pos_str', object, 1), ('pos_num', object, 1)]), axis=0)
    
    # if sentence is too long cut it
    if len(sentence) > avg:
      sentence = sentence[0:avg]
      
    sentences['parsed'][i] = sentence
      
  return sentences

In [11]:
raw_sentences = load_raw_sentences()
print("Total number of the loaded sentences: " + str(len(raw_sentences)))

Total number of the loaded sentences: 105337


In [17]:
#load only the first 10000 sentences for demonstration (parsing 100000+ sentences would be 10+ minutes)
dataset = parse_raw_sentences(raw_sentences[0:10000], True)

1000 sentences parsed in 6 seconds.
2000 sentences parsed in 12 seconds.
3000 sentences parsed in 18 seconds.
4000 sentences parsed in 25 seconds.
5000 sentences parsed in 31 seconds.
6000 sentences parsed in 37 seconds.
7000 sentences parsed in 43 seconds.
8000 sentences parsed in 49 seconds.
9000 sentences parsed in 55 seconds.
10000 sentences parsed in 62 seconds.
10000 sentences successfully parsed.


In [19]:
avg_before = count_avg_sentence_len(dataset)
print("The average sentence length before the equalization: " + str(avg_before))

# equalize the length of the sentences (we need the number of words to be equal)
dataset = equalize_sentence_len(dataset)

avg_after = count_avg_sentence_len(dataset)
print("And after: " + str(avg_after))

The average sentence length before the equalization: 23.0398
And after: 24.0


In [20]:
print("One sample sentence:\n")
print("Author: " + str(dataset["author"][0]) )
print("Sentence: " + dataset["original"][0])

One sample sentence:

Author: 0
Sentence: Bulk cocoa shipments from West Africa will more than double to 325,000 tonnes in 1996/97, solidifying a cost-cutting trend sparked by recent trial shipments, exporters and shippers said.


In [21]:
print("The parsed form of the above sentence:")
df = pd.DataFrame(data=dataset["parsed"][0])
df.T

The parsed form of the above sentence:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
text,Bulk,cocoa,shipments,from,West,Africa,will,more,than,double,to,325000,tonnes,in,1996/97,solidifying,a,cost,cutting,trend,sparked,by,recent,trial
vector,"[-0.28515, 0.26314, -0.16877, 0.36536, -0.1552...","[0.1582, -0.072933, 0.1475, -0.1833, -0.45731,...","[-0.27237, 0.37706, 0.44835, 0.2931, 0.22239, ...","[0.01332, -0.051085, -0.13207, 0.40386, 0.2113...","[0.22047, 0.023535, 0.61011, -0.17253, 1.132, ...","[-0.44178, 0.14558, 0.47388, -0.41953, 0.52292...","[0.027165, 0.29879, -0.019263, -0.0043049, -0....","[-0.39717, 0.30269, -0.18428, -0.065407, 0.196...","[-0.39611, 0.18991, -0.020033, -0.39995, 0.190...","[-0.10505, 0.15456, -0.25162, 0.017396, 0.1377...","[0.31924, 0.06316, -0.27858, 0.2612, 0.079248,...","[-0.0030893, -0.106, 0.29317, -0.22709, -0.201...","[-0.16118, 0.25736, -0.0994, 0.3978, -0.11761,...","[0.089187, 0.25792, 0.26282, -0.029365, 0.4718...","[-0.045365, -0.43793, 0.39185, 0.22654, 0.2276...","[0.61367, -0.092586, 0.18408, 0.15051, 0.48288...","[0.043798, 0.024779, -0.20937, 0.49745, 0.3601...","[-0.89423, 0.39636, 0.64359, -0.19608, -0.0955...","[-0.31639, 0.61819, 0.18432, -0.51989, 0.06971...","[-0.043012, -0.027765, 0.27702, -0.032487, 0.5...","[-0.1655, 0.14283, 0.50184, 0.54028, 0.089523,...","[-0.15552, -0.33723, -0.097191, -0.21617, -0.3...","[-0.33847, 0.058326, 0.098077, 0.20065, 0.1233...","[-0.20203, 0.12291, -0.045195, -0.0055856, -0...."
pos_str,ADJ,NOUN,NOUN,ADP,PROPN,PROPN,VERB,ADV,ADP,ADJ,ADP,NUM,NOUN,ADP,NUM,VERB,DET,NOUN,VERB,NOUN,VERB,ADP,ADJ,NOUN
pos_num,83,91,91,84,95,95,99,85,84,83,84,92,91,84,92,99,89,91,99,91,99,84,83,91


In [0]:
reworked_dataset = np.empty([len(dataset)], dtype=[('input', object, 1), ('output', object, 1)])
for j in range(0, len(dataset)):
    output_ = dataset[j]['author']
    
   
    for i in range(len(dataset[j]['parsed']['vector'])):
        input_ = (np.append(dataset[j]['parsed']['vector'][i], dataset[j]["parsed"]["pos_num"][i]/100))
        
    reworked_dataset[j] = (input_, output_)

In [0]:
np.random.shuffle(reworked_dataset)

In [0]:
nb_samples = len(reworked_dataset)
valid_split = 0.2
test_split = 0.1

# train-valid-test split
X_train = reworked_dataset['input'][0:int(nb_samples*(1-valid_split-test_split))]
X_valid = reworked_dataset['input'][int(nb_samples*(1-valid_split-test_split)):int(nb_samples*(1-test_split))]
X_test  = reworked_dataset['input'][int(nb_samples*(1-test_split)):]

Y_train = reworked_dataset['output'][0:int(nb_samples*(1-valid_split-test_split))]
Y_valid = reworked_dataset['output'][int(nb_samples*(1-valid_split-test_split)):int(nb_samples*(1-test_split))]
Y_test  = reworked_dataset['output'][int(nb_samples*(1-test_split)):]

In [25]:
from keras.utils import np_utils

Y_train = np_utils.to_categorical(Y_train, 50) # one hot encoding
Y_valid = np_utils.to_categorical(Y_valid, 50)
Y_test = np_utils.to_categorical(Y_test, 50)

Using TensorFlow backend.
