# Machine Learning for Author Attribution - Create Dataset

## Genevieve Hayes
### 10th November 2018

In [None]:
# Add: by s1280124
# Google Colaboratoryで動かしているので、Google Driveのアクセスを許可させる
from google.colab import drive
drive.mount('/content/drive')

# このプロジェクトの為のディレクトリまで移動する
# 使う人や環境によって変更してください
import os
os.chdir('/content/drive/MyDrive/Colab/ENG4')
print(f'Move to:　{os.getcwd()}')

Mounted at /content/drive
Move to:　/content/drive/MyDrive/Colab/ENG4


### Overview


This notebook is used to create the dataset used in the Analysis notebook. The final output is a csv file containing sentence-long text excerpts from the works of famous authors, along with labels identifying the authors of the excerpts. 

To create this dataset, we use novels written by eight classic authors (Louisa May Alcott, Jane Austen, Charlotte Bronte, Wilkie Collins, Arthur Conan Doyle, L.M. Montgomery, Bram Stoker and Mark Twain), all of whom wrote in the English language during the 19th and early-20th century. The novel texts were obtained as text files from [Project Gutenburg](https://www.gutenberg.org/) and chapter/section headings were manually removed from the files prior to processing, since these were not considered to be part of the main text.

To allow for the creation of a balanced dataset, for authors whose novels tended to be shorter in length, text excerpts were taken from multiple works.

The novels used to create the dataset are as follows:



|Author     | Novels| Genre | Year of Publication|
|---------  |-------|-------|--------------------|
|Louisa May Alcott | *Little Women* |Coming of Age/Romance | 1869 |
|Jane Austen| *Pride and Prejudice* and *Emma*|Romance | 1813/1815 |
|Charlotte Bronte| *Jane Eyre* | Gothic Romance | 1847 |
|Wilkie Collins | *The Woman in White* | Mystery | 1859 |
|Arthur Conan Doyle | *A Study in Scarlet*, *The Sign of the Four* and *The Hound of the Baskervilles*| Mystery |1887/1890/1902| 
|L.M. Montgomery | *Anne of Green Gables* and *Anne of Avonlea* |Coming of Age | 1908/1909 |
|Bram Stoker | *Dracula* | Horror | 1897|
|Mark Twain | *The Adventures of Tom Sawyer* and *The Adventures of Huckleberry Finn*|Coming of Age/Adventure|1876/1884|

### Import Packages

In [None]:
from nltk import tokenize
import numpy as np
import random
import pandas as pd

# これが必要
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Load Data and Create Combined Dataset

In creating the sentence lists, we exclude sentences of less than 5 characters in length, as these are unlikely to be proper sentences and are likely too short to contain any useful information. As the sentence tokenizer has difficulties in identifying the end of sentences under some circumstances (e.g. if the full-stop at the end of the sentence is contained within quotation marks), we make some minor adjustments to the text prior to tokenization using the `replace` function.

In [None]:
def split_text(filepath, min_char):
    """Convert text file to a list of sentences.
    
    Args:
    filepath: string. Filepath of text file.
    min_char: int. Minimum number of characters required for a sentence to be
    included.

    Returns:
    sentences: list of strings. List of sentences containined in the text file.
    """
    # Load data into string variable and remove new line characters
    file = open(filepath, "r", encoding="utf8")
    text = file.read().replace('\n', ' ')
    text = text.replace('.”', '”.').replace('."', '".').replace('?”', '”?').replace('!”', '”!')
    text = text.replace('--', ' ').replace('. . .', '').replace('_', '')
    file.close()
    
    # Split text into a list of sentences
    sentences = tokenize.sent_tokenize(text)
    
    # Remove sentences that are less than min_char long
    sentences = [sent for sent in sentences if len(sent) >= min_char]

    return list(sentences)

**Create sentence list for each author**

In [None]:
# Set parameter values
min_char = 5

# Create lists
alcott = split_text('Books/Little_Women.txt', min_char = min_char)
austen = split_text('Books/Pride_and_Prejudice.txt', min_char = min_char)\
         + split_text('Books/Emma.txt', min_char = min_char)
bronte = split_text('Books/Jane_Eyre.txt', min_char = min_char)
collins = split_text('Books/Woman_in_White.txt', min_char = min_char)
doyle = split_text('Books/Study_in_Scarlet.txt', min_char = min_char)\
        + split_text('Books/Sign_of_the_Four.txt', min_char = min_char)\
        + split_text('Books/Hound_of_the_Baskervilles.txt', min_char = min_char)
montgomery = split_text('Books/Anne_of_Green_Gables.txt', min_char = min_char)\
             + split_text('Books/Anne_of_Avonlea.txt', min_char = min_char)
stoker = split_text('Books/Dracula.txt', min_char = min_char)
twain = split_text('Books/Tom_Sawyer.txt', min_char = min_char)\
        + split_text('Books/Huckleberry_Finn.txt', min_char = min_char)

In [None]:
# Print length of each list

text_dict = {'Alcott': alcott, 'Austen': austen, 'Bronte': bronte, 'Collins': collins,
             'Doyle': doyle, 'Montgomery': montgomery, 'Stoker': stoker, 'Twain': twain}

for key in text_dict.keys():
    print(key, ':', len(text_dict[key]))

Alcott : 9447
Austen : 14414
Bronte : 9767
Collins : 13520
Doyle : 9421
Montgomery : 12274
Stoker : 8641
Twain : 10712


All lists contain between 8641 and 14414 sentences. So that our final dataset doesn't become skewed towards a single author, we will randomly select 8500 sentences from each list (without replacement) to form the final dataset.

**Select and combine sentences **

In [None]:
# Set random seed
np.random.seed(1)

# Set length parameter
max_len = 8500

# Select sentences
names = [alcott, austen, bronte, collins, doyle, montgomery, stoker, twain]
combined = []

for name in names:
    name = np.random.choice(name, max_len, replace = False)
    combined += list(name)

print('The length of the combined list is:', len(combined))

The length of the combined list is: 68000


**Create labels list**

In [None]:
labels = ['Alcott']*max_len + ['Austen']*max_len + ['Bronte']*max_len + ['Collins']*max_len\
         + ['Doyle']*max_len + ['Montgomery']*max_len + ['Stoker']*max_len + ['Twain']*max_len

print('The length of the labels list is:', len(labels))

The length of the labels list is: 68000


**Randomly sort data**

We randomly shuffle the data to avoid any issues arising from the bunching together of sentences by a single author.

In [None]:
# Set random seed
random.seed(3)

# Randomly shuffle data
zipped = list(zip(combined, labels))
random.shuffle(zipped)
combined, labels = zip(*zipped)

**Create and export final dataset**

In [None]:
# Create pandas dataframe
out_data = pd.DataFrame()
out_data['text'] = combined
out_data['author'] = labels

print(out_data.head())

                                                text   author
0  I'm afraid I couldn't like him without a spice...   Alcott
1  Yonder was the banks and the islands, across t...    Twain
2  Well, as I was saying about the parlor, there ...    Twain
3  Here, again, the Count had not openly committe...  Collins
4  “No,” assented Tom, “they don't kill the women...    Twain


In [None]:
# Export as a csv file
out_data.to_csv('./data/author_data.csv', index=False)