# 04 Creation of train and test sets

In this notebook we will define a function to split the book into 5 book bundles, and we will create the test set from one and the four folds of the training set from the four others.

We have to keep in mind two aspect.
We would like to represent each author equally in both the train and test sets.
Futhermore, to make our task more realistic, as already discussed and as explained below, the test set should come from different books/book groups than the train set.


## Selecting test/training/validation books

Let's import the dataframe summarizing the data on books.

In [1]:
import pandas as pd
import os
no_dupl=pd.read_csv('no_dupl.csv', index_col=0)
no_dupl

Unnamed: 0_level_0,title,author,length,year,group,chunk_numbers
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
34204,La petite Fadette,George Sand,331806.0,1869.0,34204,199.0
24850,Lourdes,Émile Zola,1068528.0,1894.0,24850,680.0
41054,Cours familier de Littérature - Volume 07,Alphonse de Lamartine,505709.0,1859.0,41054,315.0
63794,La comédie de celui qui épousa une femme muette,Anatole France,48980.0,1912.0,63794,24.0
12367,"Le péché de Monsieur Antoine, Tome 1",George Sand,559615.0,1845.0,Antoine,353.0
...,...,...,...,...,...,...
13668,Le château des Désertes,George Sand,294819.0,,13668,183.0
37604,Cours familier de Littérature - Volume 10,Alphonse de Lamartine,450970.0,1860.0,37604,270.0
17693,"La San-Felice, Tome 01",Alexandre Dumas,388396.0,1800.0,San-Felice,234.0
7772,Les Quarante-Cinq — Tome 3,Alexandre Dumas,459954.0,,Valois,287.0


During the course, we chose our test set randomly. When we did cross-validation, the folds were also selected randomly. In this case, we have to somewhat restrict the selection.

Indeed, it would be too easy to recognize authors if a test sample was from the same book or same book group as the train sample. In this case the classifier would not only have the style of the author as a clue, but also the names of the characters or the formatting of the book.

If possible, we would like this to hold for the cross-validation too: the training chunks should come from different books/book groups than the validation chunks.


To achieve this, we will define 5 book bundles and we will use predefined splits for splitting the training set in cross-validation. The test set and each of the folds in the training set will come from each of the book bundles. We try to make sure that that each book group (i.e. volumes of novels or books constituting a saga) are contained in a book bundle.

This way, we can avoid that a classifier trains on "The three musketeers", sees "Aramis" in it and when it sees a chunk from "Twenty Years After", a book from the same saga, it simply recognizes it because "Aramis" figures in that chunk too. We will group all books of this saga in one of the 4+1 book bundles.

How many chunks do we have per author?

In [2]:
no_dupl.groupby("author")["chunk_numbers"].sum().sort_values()

author
Gustave Flaubert          2680.0
Anatole France            3662.0
Guy de Maupassant         3896.0
Marcel Proust             4234.0
Victor Hugo               7006.0
Alphonse de Lamartine     8198.0
Jules Verne              12434.0
Émile Zola               13435.0
George Sand              19777.0
Alexandre Dumas          20019.0
Name: chunk_numbers, dtype: float64

We would like to generate test and train tests sets that are more or less the same size for each author. Let's try the bundles so that we can choose a test set of 300 per author and a training set of 1200 per author. We will also try to make sure that the test set comes at least eightth of the books/book groups.

In [3]:
def attempt_bundles(refs, sizes, min_sums, min_elements): #we have a references list (can be a book number list or a book group name list) with corresponding sizes, we want to make bundles with at least min_elements (a list) and at least min_sums of aggregate size (also a list)
    bundles_nr=len(min_sums)
    bundles=[]
    bundle_idx=0
    current_bundle=[]
    ref_dict=dict(zip(refs, sizes))
    for i, ref in enumerate(refs):
        
        current_bundle.append(ref)
        
        if (sum([ref_dict[e] for e in current_bundle])>=min_sums[bundle_idx])&(len(current_bundle)>=min_elements[bundle_idx]):
            bundle_idx+=1
            bundles.append(current_bundle)
            if bundle_idx==bundles_nr:
                bundles[-1]=current_bundle+refs[i+1:]
                break
            current_bundle=[]    
    if bundle_idx==bundles_nr:
        return bundles
    else:
        return 'no success'

In [4]:
# This is a not very elegant and inefficient (although not at all bottleneck-creating) code but gets the job done.
# For each author, first we try to get split the books into 4 book bundles for validation and a test book bundle in a way that
# the bundles respect the book group boundaries (there is no book group with one book in a bundle and another book another bundle
# which would lead to train and validate/test on chunks from the same book groups)

# For this, we try a permutation of books, check if the first book is enough for a test book bundle (i.e. has more than setsize["test"] chunks),
# if not, we try with adding the second book, if still not enough, the we also add the third etc. When we are ready with the test bundle, we add books in a similar
# way to the first training book bundle etc.
# if the permutation fails, we try another permutation, after 10 tries we give up.

# When it does not work (for Proust) we relax the condition and only try to make sure that the test bundle has different book groups than the other bundles.



import numpy as np


no_dupl.loc[:,"trainortest"]="None"

np.random.seed(42)
for author, author_df in no_dupl.groupby('author'):
    chunks_per_group=author_df.groupby("group")["chunk_numbers"].sum()
    success=True
    for i in range(10):
        perm=chunks_per_group.loc[np.random.permutation(chunks_per_group.index)]
        output=attempt_bundles(refs=list(perm.index), sizes=list(perm.values), min_sums=[300]*5, min_elements=[len(perm)/8]*5)
        if output!='no success':
            for idx, bundle in enumerate(output):
                no_dupl.loc[no_dupl['group'].isin(bundle), 'trainortest']=['test', 'valid1', 'valid2', 'valid3', 'valid4'][idx] 
            success=True
            break
        else:
            success=False
    if not success:
        print("I cannot select test book bundles and training book bundles with these parameters for {}. The test book bundles will not be from different book groups".format(author))
        for j in range(30):
            perm=chunks_per_group.loc[np.random.permutation(chunks_per_group.index)]
            output_test=attempt_bundles(refs=list(perm.index), sizes=list(perm.values), min_sums=[300, 1200], min_elements=[1, 1]) #test book bundle creation
            #print('test', perm.values, [len(perm)/8, len(perm)/8*4], output_test)
            if output_test!='no success':
                sizes_train=author_df.loc[author_df['group'].isin(output_test[1]), 'chunk_numbers']
                output_train=attempt_bundles(refs=list(sizes_train.index), sizes=list(sizes_train.values), min_sums=[300]*4, min_elements=[len(sizes_train)/8]*4) #train book bundle creation, relaxed version
                #print(output_train)
                if output_train!='no success':
                    no_dupl.loc[no_dupl['group'].isin(output_test[0]), 'trainortest']='test'
                    for idx, bundle in enumerate(output_train):
                        no_dupl.loc[no_dupl.index.isin(bundle), 'trainortest']=['valid1', 'valid2', 'valid3', 'valid4'][idx] 
                    success=True
                    break
            else:
                success=False
        if not success:
            print("I cannot select test and validation books with these parameters for {} at all.".format(author))
            #break
no_dupl
        


I cannot select test book bundles and training book bundles with these parameters for Marcel Proust. The test book bundles will not be from different book groups


Unnamed: 0_level_0,title,author,length,year,group,chunk_numbers,trainortest
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
34204,La petite Fadette,George Sand,331806.0,1869.0,34204,199.0,valid4
24850,Lourdes,Émile Zola,1068528.0,1894.0,24850,680.0,valid4
41054,Cours familier de Littérature - Volume 07,Alphonse de Lamartine,505709.0,1859.0,41054,315.0,valid4
63794,La comédie de celui qui épousa une femme muette,Anatole France,48980.0,1912.0,63794,24.0,test
12367,"Le péché de Monsieur Antoine, Tome 1",George Sand,559615.0,1845.0,Antoine,353.0,test
...,...,...,...,...,...,...,...
13668,Le château des Désertes,George Sand,294819.0,,13668,183.0,valid2
37604,Cours familier de Littérature - Volume 10,Alphonse de Lamartine,450970.0,1860.0,37604,270.0,test
17693,"La San-Felice, Tome 01",Alexandre Dumas,388396.0,1800.0,San-Felice,234.0,valid1
7772,Les Quarante-Cinq — Tome 3,Alexandre Dumas,459954.0,,Valois,287.0,valid4


Indeed we now have at least 300 chunks in every book bundle:

In [5]:
(no_dupl.groupby(["author","trainortest"])["chunk_numbers"].sum()<300).sum()

0

And only one book group is in more than one bundle:

In [6]:
for a, b in no_dupl.groupby(['author', 'group'])['trainortest']:
    if len(b.unique())>1:
        print(a, b.unique())

('Marcel Proust', 'Recherche') ['valid1' 'valid2' 'valid3' 'valid4']


While the selection of books/book groups that form the book bundles  (from which the chunks constituting the test set and the different folds of the training set will be selected) is somewhat random, in some cases the randomness is quite limited. For Proust, the only way to make sure we have enough test chunks is to use the two books outside of In search of lost time for testing. This limited randomness is a compromise we have to make if we want our classification algorithm to use only style elements and not names.

## Creation of train and test sets

Let's create our train and test folders.

In [7]:
for folder in ["valid1", "valid2", "valid3", "valid4", "test"]:
    if not os.path.exists(folder):
        os.mkdir(folder)

Now let's define a function to randomly select a given number of chunks from a selection of an author's books.

In [8]:
import random
random.seed(42)
def dataset_creator(author, dataset_size, trainortest):
    book_numbers=no_dupl[(no_dupl["author"]==author)&(no_dupl["trainortest"]==trainortest)].index
    
    chunks=[]
    chunk_book_numbers=[]
    for n in book_numbers:
        
        filename=os.path.join("chunkified", author, str(n))
        with open(filename, 'r') as f:
            chunksinfile=f.read().split("\n\t\t\n")
            if author=='Alphonse de Lamartine':
                chunksinfile=[chunk for chunk in chunksinfile
                              if ("LAMARTINE" not in chunk) and ("ENTRETIEN" not in chunk) and ("Rouge frères, Dunon et Fresné" not in chunk)]
            chunks+=chunksinfile 
            chunk_book_numbers+=[n]*len(chunksinfile)
    sample=random.sample(range(len(chunks)), dataset_size)
    chunk_sample=[chunks[i] for i in sample]
    book_number_sample=[chunk_book_numbers[i] for i in sample]
    return chunk_sample, book_number_sample
        
        

In [9]:
for t in ["valid1", "valid2", "valid3", "valid4", "test"]:
    if t=="test":
        ds_size=300
    else:
        ds_size=300
    all_chunk_book_numbers=pd.DataFrame(columns=no_dupl["author"].unique())
    for a in no_dupl["author"].unique():

        chunks, chunk_book_numbers=dataset_creator(a, ds_size, t)
        all_chunk_book_numbers[a]=chunk_book_numbers
        content="\n\t\t\n".join(chunks)
        filename=os.path.join(t, a)
        if not os.path.isfile(filename):
            with open(filename, 'w') as f:
                f.write(content)
    filename=os.path.join(t, 'chunk_numbers.csv')
    if not os.path.isfile(filename):
        all_chunk_book_numbers.to_csv(filename)
        


In [10]:
no_dupl.to_csv('no_dupl.csv')

## New manuscript or lazy student? - further explanation for the bundles

There are two ways to consider our style classification task:
    <ul>
    <li>as a task for a <b>lazy literary student</b>. Let's say a student has read 80% of the bibliography, i.e 80% of the chunks. She is then given a chunk and has to to determine the author. In this case, using the protagonists' names or a certain formatting choice (e.g. which kind of quotation marks the book has) to recognise the author is fair game. The training set can contain "Les trois mousquetaires" and the test set can contain "Vingt ans apres" as we can get points if we can recognize that "Athos" means Dumas. </li>
    <li>as a task to identify the author of a <b>new manuscript</b>. Let's say we have found a page from an unknown book and we want to recognize the author. It is reasonable to assume that the style (including the themes evoked) will be similar, but it will not be a new volume of a known book or a sequel to a saga. We cannot use the names and the formatting. While we only simulate this task and the simulation is far from perfect, this is what we aim for and that is why we try to eliminate formatting and named entity clues. </li>
    </ul>