# Boostrapping Document Classification
Categorizing documents by style is known to be a difficult problem in Machine Learning and Natural Language Processing. However, every bit of success has the potential to reduce the burden of manual labelling.

In this notebook we'll provide a real world example of boostrapping and iterating with document classification.

## The Problem
* The corpus documents of the Latin Library
* Proposed author/era groupings
    * https://en.wikipedia.org/wiki/Classical_Latin
    * https://en.wikipedia.org/wiki/Latin_literature
    * https://en.wikipedia.org/wiki/Latin

## The Goal
* Provide a mapping of authors-to-eras so CLTK users can easily load commonly used slices of corpus data    
    
## Strategy: Supervised Learning
* Label some documents and authors
* Train a classifier and predict the era of unknown docs
* Compare results with human corrections

Often when the task is to classify a large number of documents it is useful to perform several iterations of document classification. Between iterations, one has the opportunity to assess and improve the classifications using semi-supervised learning.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.simplefilter('ignore') # quiet warnings for presentation purposes only

## Imports

In [2]:
import os
import json
from collections import defaultdict
import random

import joblib
from tqdm import tqdm

from cltk.corpus.readers import FilteredPlaintextCorpusReader, get_corpus_reader, assemble_corpus
from cltk.prosody.latin.string_utils import punctuation_for_spaces_dict
from cltk.stem.latin.j_v import JVReplacer
from cltk.tokenize.sentence import TokenizeSentence
from cltk.prosody.latin.scansion_constants import ScansionConstants
from cltk.tokenize.word import WordTokenizer
import numpy as np
from sklearn.preprocessing.label import LabelEncoder
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier

### Add parent directory to path so we can access our common code

In [3]:
import sys
import inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

In [4]:
from mlyoucanuse.doc2tokens_transformer import Doc2TokensTransformer
from mlyoucanuse.corpus_fun import get_file_type_list 

## Load a CorpusReader

In [5]:
reader = get_corpus_reader(corpus_name='latin_text_latin_library', language='latin')
all_docs = list(reader.fileids())
print(f'{len(all_docs):,} documents in the corpus,  e.g.: {all_docs[:5]}...')
print(f'Random sample: {list(reader.docs(random.sample(all_docs, 1)))[0][200:250]}')

2,141 documents in the corpus,  e.g.: ['12tables.txt', '1644.txt', 'abbofloracensis.txt', 'abelard/dialogus.txt', 'abelard/epistola.txt']...
Random sample: ta
qui canat est clamor, sed qui fera nuntiet arma


## Examine the Data
After gathering counts of file types:
* `find ~/cltk_data/latin/text/latin_text_latin_library/*/ -type d  | wc -l`
* `find ~/cltk_data/latin/text/latin_text_latin_library/*.txt -type f | wc -l`

We find the 2,141 corpus files are organized into 79 directories and 537 files in the top level directory.

So for our first pass of labeling we'll try to label directories which will likely include a large number of files.

## Determine Label Categories
after consulting some Wikipedia articles on the subject: 
* https://en.wikipedia.org/wiki/Classical_Latin 
* https://en.wikipedia.org/wiki/Latin_literature

A popular consensus of organizining by era or time periods is apparent.

We will choose these eras:
* augustan
* christian
* early_silver
* late_silver
* medieval
* neo_latin
* old
* renaissance
* republican

## Label some data, but not all 

We've decided to a dictionary of era category types and a List of file instances:
* era 
    * author directory
   
and

* era
    * file name

e.g.:

In [6]:
manual_corpus_directories_by_type = {
    'republican': [
        './caesar',
        './lucretius',
        './nepos',
        './cicero'
    ],
    'augustan': [
        './livy',
        './ovid',
        './horace',
        './vergil',
        './hyginus'
    ],
    'early_silver': [
        './martial',
        './juvenal',
        './tacitus',
        './lucan',
        './quintilian',
        './sen',
        './statius',
        './silius',
        './columella'
    ],
    'late_silver': [
        './suetonius',
        './gellius',
        './apuleius'
        './justin',
        './apicius',
        './fulgentius',
        './orosius'
    ],
    'old': [
        './plautus'
    ],
    'christian': [
        './ambrose',
        './abelard',
        './alcuin',
        './augustine',
        './bede',
        './bible',
        './cassiodorus',
        './commodianus',
        './gregorytours',
        './hugo',
        './isidore',
        './jerome',
        './prudentius',
        './tertullian',
        './kempis',
        './leothegreat'
    ],
    'medieval': [
        './boethiusdacia',
        './dante'
    ],
    'renaissance': [
    ],
    'neo_latin': [
        './addison',
        './bacon',
        './bultelius',
        './descartes',
        './erasmus',
        './galileo',
        './kepler',
        './may',
        './melanchthon',
        './xylander',
        './campion'
    ]
}

#### by text

manual_corpus_texts_by_type = {
    'republican': [
        'sall.1.txt',
        'sall.2.txt',
        'sall.cotta.txt',
        'sall.ep1.txt',
        'sall.ep2.txt',
        'sall.frag.txt',
        'sall.invectiva.txt',
        'sall.lep.txt',
        'sall.macer.txt',
        'sall.mithr.txt',
        'sall.phil.txt',
        'sall.pomp.txt',
        'varro.frag.txt',
        'varro.ll10.txt',
        'varro.ll5.txt',
        'varro.ll6.txt',
        'varro.ll7.txt',
        'varro.ll8.txt',
        'varro.ll9.txt',
        'varro.rr1.txt',
        'varro.rr2.txt',
        'varro.rr3.txt',
        'sulpicia.txt'
    ],
    'augustan': [
        'resgestae.txt',
        'resgestae1.txt',
        'manilius1.txt',
        'manilius2.txt',
        'manilius3.txt',
        'manilius4.txt',
        'manilius5.txt',
        'catullus.txt',
        'vitruvius1.txt',
        'vitruvius10.txt',
        'vitruvius2.txt',
        'vitruvius3.txt',
        'vitruvius4.txt',
        'vitruvius5.txt',
        'vitruvius6.txt',
        'vitruvius7.txt',
        'vitruvius8.txt',
        'vitruvius9.txt',
        'propertius1.txt',
        'tibullus1.txt',
        'tibullus2.txt',
        'tibullus3.txt'
    ],
    'early_silver': [
        'pliny.ep1.txt',
        'pliny.ep10.txt',
        'pliny.ep2.txt',
        'pliny.ep3.txt',
        'pliny.ep4.txt',
        'pliny.ep5.txt',
        'pliny.ep6.txt',
        'pliny.ep7.txt',
        'pliny.ep8.txt',
        'pliny.ep9.txt',
        'pliny.nh1.txt',
        'pliny.nh2.txt',
        'pliny.nh3.txt',
        'pliny.nh4.txt',
        'pliny.nh5.txt',
        'pliny.nhpr.txt',
        'pliny.panegyricus.txt',
        'petronius1.txt',
        'petroniusfrag.txt',
        'persius.txt',
        'phaedr1.txt',
        'phaedr2.txt',
        'phaedr3.txt',
        'phaedr4.txt',
        'phaedr5.txt',
        'phaedrapp.txt',
        'seneca.contr1.txt',
        'seneca.contr10.txt',
        'seneca.contr2.txt',
        'seneca.contr3.txt',
        'seneca.contr4.txt',
        'seneca.contr5.txt',
        'seneca.contr6.txt',
        'seneca.contr7.txt',
        'seneca.contr8.txt',
        'seneca.contr9.txt',
        'seneca.fragmenta.txt',
        'seneca.suasoriae.txt',
        'valeriusflaccus1.txt',
        'valeriusflaccus2.txt',
        'valeriusflaccus3.txt',
        'valeriusflaccus4.txt',
        'valeriusflaccus5.txt',
        'valeriusflaccus6.txt',
        'valeriusflaccus7.txt',
        'valeriusflaccus8.txt',
        'valmax1.txt',
        'valmax2.txt',
        'valmax3.txt',
        'valmax4.txt',
        'valmax5.txt',
        'valmax6.txt',
        'valmax7.txt',
        'valmax8.txt',
        'valmax9.txt',
        'vell1.txt',
        'vell2.txt'
    ],
    'late_silver': [
    ],
    'old': [
        '12tables.txt',
        'ter.adel.txt',
        'ter.andria.txt',
        'ter.eunuchus.txt',
        'ter.heauton.txt',
        'ter.hecyra.txt',
        'ter.phormio.txt',
        'andronicus.txt',
        'enn.txt'
    ],
    'medieval': [
        'anselmepistula.txt',
        'anselmproslogion.txt',
        'carm.bur.txt'
    ],
    'christian': [
        'anon.martyrio.txt',
        'benedict.txt',
        'berengar.txt',
        'bernardclairvaux.txt',
        'bernardcluny.txt',
        'bonaventura.itinerarium.txt',
        'creeds.txt',
        'decretum.txt',
        'diesirae.txt',
        'egeria.txt',
        'ennodius.txt',
        'eucherius.txt',
        'eugippius.txt',
        'greg.txt',
        'gregory.txt',
        'gregory7.txt',
        'hydatius.txt',
        'hymni.txt',
        'innocent.txt',
        'hydatius.txt',
        'junillus.txt',
        'lactantius.txt',
        'liberpontificalis.txt',
        'macarius.txt',
        'macarius1.txt',
        'novatian.txt',
        'papal.txt',
        'paulinus.poemata.txt',
        'perp.txt',
        'professio.txt',
        'prosperus.txt',
        'regula.txt',
        'sedulius.txt',
        'sulpiciusseverus.txt',
        'vorag.txt'
    ],
    'renaissance': [
        'petrarch.ep1.txt',
        'petrarch.numa.txt',
        'petrarch.rom.txt'
    ],
    'neo_latin': [
        'spinoza.ethica1.txt',
        'spinoza.ethica2.txt',
        'spinoza.ethica3.txt',
        'spinoza.ethica4.txt',
        'spinoza.ethica5.txt'
    ]
}



#### The following were directories that weren't obviously divisible into periods based on the web site/file directory layout.
 
	./alanus',
	'./albertanus',
	'./albertofaix',
	'./aquinas',
	'./ammianus',
	'./arnobius',
	'./capellanus',
	'./cato',
	'./claudian',
	'./curtius',
	'./eutropius',
	'./frontinus',
	'./gestafrancorum',
	'./justinian',
	'./lactantius',
	'./martinbraga',
	'./mirandola',
	'./ottofreising',
	'./pauldeacon',
	'./sha',
	'./theodosius',
	'./voragine',
	'./walter',
	'./williamtyre',
 

## Assemble a CorpusReader of our classified docs

In [7]:
reader = assemble_corpus(reader, manual_corpus_texts_by_type.keys(),
                         manual_corpus_directories_by_type, 
                         manual_corpus_texts_by_type )
files_types = get_file_type_list(reader.fileids(), manual_corpus_texts_by_type, manual_corpus_directories_by_type)
docs_already_classified, categories = zip(*files_types)
reader._fileids = docs_already_classified
docs_to_classify = list(set(all_docs) ^ set(docs_already_classified))
print(f'Docs already classified: {len(docs_already_classified):,}')
print(f'Docs to classify: {len(docs_to_classify):,}')

Docs already classified: 1,278
Docs to classify: 863


## Create a LabelEncoder to convert the label names to integers, later on we'll reverse the process

In [8]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform( np.array(categories).ravel())
print(f'Y shape: {y.shape} e.g.: {y}')
print(f'Label encoder classes: {label_encoder.classes_}')

Y shape: (1278,) e.g.: [6 1 1 ... 0 0 5]
Label encoder classes: ['augustan' 'christian' 'early_silver' 'late_silver' 'medieval'
 'neo_latin' 'old' 'renaissance' 'republican']


## Use a custom transformer to prepare the data matrix

In [9]:
dummyX = [
    ['The quick brown fox. The lazy dog.'],
    ['Waiting for godot. Looking for the sunshine.']
]

newX = Doc2TokensTransformer().transform(dummyX)
print(list(newX))

[['The', 'quick', 'brow', 'fox', 'The', 'lazy', 'dog'], ['Waiting', 'for', 'godot', 'Looking', 'for', 'the', 'sunshine']]


In [10]:
X = list(reader.docs(reader.fileids()))
print(f'doc matrix size: {len(X):,}')

doc matrix size: 1,278


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# With a larger labelled data set our test size would be higher, but lower here is okay
# because we just want to see some differences among classifiers
print(f'X_train size: {len(X_train):,}, X_test size: {len(X_test):,}')
print(f'y_train size: {len(y_train):,}, y_test size: {len(y_test):,}')

X_train size: 1,150, X_test size: 128
y_train size: 1,150, y_test size: 128


### We'll use Tfidf Vectorization
Tfidf will help cluster similar words into similar eras.
However it will also tend to cluster documents by subject and theme, which will likely span across eras.
Later we will investigate the use of other approaches.

In [12]:
def train(classifier, X_train, X_test, y_train, y_test):
    classifier.fit(X_train, y_train)
    print("Accuracy: %s" % classifier.score(X_test, y_test))
    return classifier

def identity (data):
    """Identity, as in math; you know, do nothing, just be yourself."""
    return data

trial1 = Pipeline([
    ('normalizer', Doc2TokensTransformer()),
    ('vectorizer', TfidfVectorizer(
        analyzer='word',
        tokenizer=identity,
        preprocessor=identity,
        token_pattern=None)),
    ('classifier', MultinomialNB())
])

train(trial1, X_train, X_test, y_train, y_test)

Accuracy: 0.6953125


Pipeline(memory=None,
     steps=[('normalizer', Doc2TokensTransformer(drop_regexes=[re.compile('[0-9]+[a-zA-Z]'), re.compile('\\s+')],
           language=None,
           valid_chars={'Ȳ', 'ī', 'O', 'w', 'K', 'P', 'Ē', 'j', 'n', 'Ū', 'J', 'T', 'G', 'ū', 'Ā', 'a', 'e', 'A', 'V', 'Ü', 'z', 'ō', 'q', 'X', 'Y', 'C', 'ö', 'ā', '...True, vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Train a Dummy classifier to see the baseline we must improve above

In [13]:
dummy = DummyClassifier(strategy='stratified', random_state=0)
features_train, features_test, target_train, target_test = train_test_split(X, y, random_state=0)
dummy.fit(features_train, target_train)
dummy_score = dummy.score(features_test, target_test)
print(f'Dummy classifier: {dummy_score}')

Dummy classifier: 0.175


In [14]:
def model_selection(X_train, y_train, X_test, y_test, estimator):
    """
    Test various estimators.
    """

    model = Pipeline([
        ('normalizer', Doc2TokensTransformer()),
        ('vectorizer', TfidfVectorizer(
            analyzer='word',
            tokenizer=identity,
            preprocessor=identity,
            token_pattern=None)),
        ('classifier', estimator())
    ])
    model.fit(X_train, y_train)
    return ("Accuracy: %s" % model.score(X_test, y_test))


In [15]:
classifiers = [LinearSVC, SVC, KNeighborsClassifier, LogisticRegressionCV, MultinomialNB,
        LogisticRegression, SGDClassifier, BaggingClassifier,
        ExtraTreesClassifier, RandomForestClassifier]

for cls in tqdm(classifiers):
    print(f'Testing {str(cls)}')
    print(model_selection(X_train, y_train, X_test, y_test, cls))
    print(''.ljust(50, '-'))


  0%|          | 0/10 [00:00<?, ?it/s]

Testing <class 'sklearn.svm.classes.LinearSVC'>


 10%|█         | 1/10 [00:02<00:24,  2.75s/it]

Accuracy: 0.6796875
--------------------------------------------------
Testing <class 'sklearn.svm.classes.SVC'>


 20%|██        | 2/10 [00:05<00:21,  2.75s/it]

Accuracy: 0.6796875
--------------------------------------------------
Testing <class 'sklearn.neighbors.classification.KNeighborsClassifier'>


 30%|███       | 3/10 [00:08<00:19,  2.74s/it]

Accuracy: 0.6640625
--------------------------------------------------
Testing <class 'sklearn.linear_model.logistic.LogisticRegressionCV'>


 40%|████      | 4/10 [00:12<00:19,  3.28s/it]

Accuracy: 0.6796875
--------------------------------------------------
Testing <class 'sklearn.naive_bayes.MultinomialNB'>


 50%|█████     | 5/10 [00:15<00:15,  3.13s/it]

Accuracy: 0.6953125
--------------------------------------------------
Testing <class 'sklearn.linear_model.logistic.LogisticRegression'>


 60%|██████    | 6/10 [00:18<00:12,  3.01s/it]

Accuracy: 0.6953125
--------------------------------------------------
Testing <class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>


 70%|███████   | 7/10 [00:20<00:08,  2.93s/it]

Accuracy: 0.5546875
--------------------------------------------------
Testing <class 'sklearn.ensemble.bagging.BaggingClassifier'>


 80%|████████  | 8/10 [00:23<00:05,  2.87s/it]

Accuracy: 0.6796875
--------------------------------------------------
Testing <class 'sklearn.ensemble.forest.ExtraTreesClassifier'>


 90%|█████████ | 9/10 [00:26<00:02,  2.85s/it]

Accuracy: 0.6796875
--------------------------------------------------
Testing <class 'sklearn.ensemble.forest.RandomForestClassifier'>


100%|██████████| 10/10 [00:29<00:00,  2.81s/it]

Accuracy: 0.6796875
--------------------------------------------------





In [42]:
# Train all the data using a reasonable winner
model = Pipeline([
    ('normalizer', Doc2TokensTransformer()),
    ('vectorizer', TfidfVectorizer(
        analyzer='word',
        tokenizer=identity,
        preprocessor=identity,
        token_pattern=None)),
    ('classifier', LogisticRegression())
])

model.fit(X, y)

Pipeline(memory=None,
     steps=[('normalizer', Doc2TokensTransformer(drop_regexes=[re.compile('[0-9]+[a-zA-Z]'), re.compile('\\s+')],
           language=None,
           valid_chars={'Ȳ', 'ī', 'O', 'w', 'K', 'P', 'Ē', 'j', 'n', 'Ū', 'J', 'T', 'G', 'ū', 'Ā', 'a', 'e', 'A', 'V', 'Ü', 'z', 'ō', 'q', 'X', 'Y', 'C', 'ö', 'ā', '...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

In [43]:
unclassified_reader = get_corpus_reader(corpus_name='latin_text_latin_library', language='latin')
unclassified_reader.skip_keywords = None
to_classify_X = list(unclassified_reader.docs(docs_to_classify ))
new_labels = model.predict(to_classify_X)
print (f'Shape new_labels: {new_labels.shape} e.g.: {new_labels[:5]}...')


Shape new_labels: (863,) e.g.: [1 2 8 1 1]...


In [44]:
new_cats = defaultdict(list)
for idx, filename in enumerate(docs_to_classify):
    new_cats[label_encoder.classes_[new_labels[idx]]].append(filename)
print(new_cats)


defaultdict(<class 'list'>, {'christian': ['sha/30.txt', 'ammianus/27.txt', 'justin/32.txt', 'ilias.txt', 'appverg.aetna.txt', 'gaius3.txt', 'theodosius/theod13.txt', 'andreasbergoma.txt', 'voragine/ambro.txt', 'justin/28.txt', 'walter/walter3.txt', 'justin/42.txt', 'albertofaix/hist12.txt', 'capellanus/capellanus3.txt', 'johannes.txt', 'ammianus/18.txt', 'justin/4.txt', 'voragine/vir.txt', 'williamtyre/14.txt', 'ammianus/24.txt', 'voragine/georgio.txt', 'justin/33.txt', 'venantius.txt', 'justin/18.txt', 'williamtyre/3.txt', 'avienus.periegesis.txt', 'vegetius2.txt', 'walter6.txt', 'avienus.ora.txt', 'justin/15.txt', 'theodosius/theod16.txt', 'iamdulcis.txt', 'walter12.txt', 'dumdiane.txt', 'richerus2.txt', 'williamtyre/11.txt', 'justin/30.txt', 'williamtyre/4.txt', 'valesianus2.txt', 'ammianus/23.txt', 'theodosius/theod04.txt', 'archpoet.txt', 'withof3.txt', 'smarius.txt', 'voragine/jud.txt', 'justin/13.txt', 'justinian/institutes.proem.txt', 'williamtyre/22.txt', 'sha/firmus.txt', 'v

In [45]:
with open('new_cats.json', mode='w', encoding='utf8') as writer:
    json.dump(new_cats, writer, indent=2)


In [46]:
with open('latin_text_classifier.mdl.pkl', 'wb') as writer:
    joblib.dump(new_cats, writer)


In [47]:
predicted_files_cats = [ (item, key ) 
               for key, value in new_cats.items()
              for item in value               ]
predicted_files_cats.sort(key=lambda x: x[0])
print(len(predicted_files_cats))
random.sample(predicted_files_cats, 10)

863


[('falcone.txt', 'republican'),
 ('apuleius/apuleius8.txt', 'early_silver'),
 ('landor1806.txt', 'early_silver'),
 ('apuleius/apuleius2.txt', 'early_silver'),
 ('frontinus/strat2.txt', 'late_silver'),
 ('ottofreising/epistola.txt', 'augustan'),
 ('columba2.txt', 'augustan'),
 ('albertofaix/hist9.txt', 'christian'),
 ('sha/aurel.txt', 'christian'),
 ('nemesianus4.txt', 'republican')]

# Vetting the data
The mapping files and directories to eras have been manually corrected outside of this notebook, so let's use them to compare how our classifier performed.

In [48]:
# Load the human corrected labellings, they have been put into the CLTK library
from cltk.corpus.latin.latin_library_corpus_types import corpus_directories_by_type
from cltk.corpus.latin.latin_library_corpus_types import corpus_texts_by_type 

reader = get_corpus_reader(corpus_name='latin_text_latin_library', language='latin')
human_corrected_labels = get_file_type_list(reader.fileids(), 
                                            corpus_texts_by_type, 
                                            corpus_directories_by_type)
print(f'Number of human corrected labels {len(human_corrected_labels):,} Sample:')
human_files, human_labels = zip(*human_corrected_labels)
human_corrected_labels[:10]


Number of human corrected labels 1,747 Sample:


[('12tables.txt', 'early'),
 ('abelard/dialogus.txt', 'christian'),
 ('abelard/epistola.txt', 'christian'),
 ('abelard/historia.txt', 'christian'),
 ('addison/barometri.txt', 'neo_latin'),
 ('addison/burnett.txt', 'neo_latin'),
 ('addison/hannes.txt', 'neo_latin'),
 ('addison/machinae.txt', 'neo_latin'),
 ('addison/pax.txt', 'neo_latin'),
 ('addison/praelium.txt', 'neo_latin')]

### Note
The number of human corrected labels does not match the total number of files in the corpus, 
because if a file couldn't be reasonably matched to era, then a categorization wasn't posited. 
Often it's better to be clear about what is uncertain, rather than to push a false sense of confidence.

In [49]:
    
num_good_predictions= sum([1 for prediction in predicted_files_cats
    if prediction in human_corrected_labels])    
print(f'Number of good predictions: {num_good_predictions}')    


for prediction in predicted_files_cats:
        if prediction in human_corrected_labels:
            print(prediction)


Number of good predictions: 1
('apuleius/apuleius.florida.txt', 'late_silver')


In [50]:
human_corrected_labels

[('12tables.txt', 'early'),
 ('abelard/dialogus.txt', 'christian'),
 ('abelard/epistola.txt', 'christian'),
 ('abelard/historia.txt', 'christian'),
 ('addison/barometri.txt', 'neo_latin'),
 ('addison/burnett.txt', 'neo_latin'),
 ('addison/hannes.txt', 'neo_latin'),
 ('addison/machinae.txt', 'neo_latin'),
 ('addison/pax.txt', 'neo_latin'),
 ('addison/praelium.txt', 'neo_latin'),
 ('addison/preface.txt', 'neo_latin'),
 ('addison/resurr.txt', 'neo_latin'),
 ('addison/sphaer.txt', 'neo_latin'),
 ('alanus/alanus1.txt', 'misc'),
 ('alanus/alanus2.txt', 'misc'),
 ('albertanus/albertanus.arsloquendi.txt', 'misc'),
 ('albertanus/albertanus.liberconsol.txt', 'misc'),
 ('albertanus/albertanus.sermo.txt', 'misc'),
 ('albertanus/albertanus.sermo1.txt', 'misc'),
 ('albertanus/albertanus.sermo2.txt', 'misc'),
 ('albertanus/albertanus.sermo3.txt', 'misc'),
 ('albertanus/albertanus.sermo4.txt', 'misc'),
 ('albertanus/albertanus1.txt', 'misc'),
 ('albertanus/albertanus2.txt', 'misc'),
 ('albertanus/albe

In [51]:
predicted_files_cats

[('1644.txt', 'republican'),
 ('abbofloracensis.txt', 'christian'),
 ('adso.txt', 'christian'),
 ('aelredus.txt', 'christian'),
 ('agnes.txt', 'neo_latin'),
 ('alanus/alanus1.txt', 'christian'),
 ('alanus/alanus2.txt', 'christian'),
 ('albertanus/albertanus.arsloquendi.txt', 'christian'),
 ('albertanus/albertanus.liberconsol.txt', 'christian'),
 ('albertanus/albertanus.sermo.txt', 'christian'),
 ('albertanus/albertanus.sermo1.txt', 'christian'),
 ('albertanus/albertanus.sermo2.txt', 'christian'),
 ('albertanus/albertanus.sermo3.txt', 'christian'),
 ('albertanus/albertanus.sermo4.txt', 'christian'),
 ('albertanus/albertanus1.txt', 'christian'),
 ('albertanus/albertanus2.txt', 'christian'),
 ('albertanus/albertanus3.txt', 'christian'),
 ('albertanus/albertanus4.txt', 'christian'),
 ('albertofaix/hist1.txt', 'christian'),
 ('albertofaix/hist10.txt', 'christian'),
 ('albertofaix/hist11.txt', 'christian'),
 ('albertofaix/hist12.txt', 'christian'),
 ('albertofaix/hist2.txt', 'christian'),
 (

## Conclusion:
Using tfidf is insufficient for clustering a single author, and across authors the shared signal is not high enough to place them in multi-doc era classification. We will experiment with further feature selection.
#### To be continued...