# Introduction

This notebook is aimed to classifying different text data into their respective authors. Data is taken from gutenberg texts. 10 seperate text files are seperated into sentences with their respective authors and stored into a dataframe. 

I will try to do clustering on these sentences and check whether clustering algorithm is able to seperate those texts according to their authors. Then i will create few models to predict the author from text sentence.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.cluster import KMeans, MeanShift, SpectralClustering
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import silhouette_score, confusion_matrix
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation
import nltk
from keras.layers import LSTM, Dense, Embedding, SpatialDropout1D
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences

nltk.download('gutenberg')

Using TensorFlow backend.
[nltk_data] Error loading gutenberg: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [2]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [3]:
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
poems = gutenberg.raw('blake-poems.txt')
stories = gutenberg.raw('bryant-stories.txt')
busterbrown = gutenberg.raw('burgess-busterbrown.txt')
alice = gutenberg.raw('carroll-alice.txt')
ball = gutenberg.raw('chesterton-ball.txt')
parents = gutenberg.raw('edgeworth-parents.txt')
moby_dick = gutenberg.raw('melville-moby_dick.txt')
paradise = gutenberg.raw('milton-paradise.txt')
hamlet = gutenberg.raw('shakespeare-hamlet.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
poems = re.sub(r'CHAPTER .*', '', poems)
stories = re.sub(r'CHAPTER .*', '', stories)
busterbrown = re.sub(r'CHAPTER .*', '', busterbrown)
ball = re.sub(r'CHAPTER .*', '', ball)
parents = re.sub(r'CHAPTER .*', '', parents)
moby_dick = re.sub(r'CHAPTER .*', '', moby_dick)
paradise = re.sub(r'CHAPTER .*', '', paradise)
hamlet = re.sub(r'CHAPTER .*', '', hamlet)

In [4]:
# Decrease size of our text dataset

alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])
poems = text_cleaner(poems[:int(len(poems)/10)])
stories = text_cleaner(stories[:int(len(stories)/10)])
busterbrown = text_cleaner(busterbrown[:int(len(busterbrown)/10)])
ball = text_cleaner(ball[:int(len(ball)/10)])
parents = text_cleaner(parents[:int(len(parents)/10)])
moby_dick = text_cleaner(moby_dick[:int(len(moby_dick)/10)])
paradise = text_cleaner(paradise[:int(len(paradise)/10)])
hamlet = text_cleaner(hamlet[:int(len(hamlet)/10)])

In [5]:
nltk.sent_tokenize(alice)

["Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'",
 'So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.',
 "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!',
 "I shall be late!'",
 '(when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT-POCKET, and

In [6]:
# Convert all the paragraph into text and it's labels

alice_sents = [[sent, "Carroll"] for sent in nltk.sent_tokenize(alice)]
persuasion_sents = [[sent, "Austen"] for sent in nltk.sent_tokenize(persuasion)]
poems_sents = [[sent, "Blake"] for sent in nltk.sent_tokenize(poems)]
stories_sents = [[sent, "Bryant"] for sent in nltk.sent_tokenize(stories)]
busterbrown_sents = [[sent, "Burgess"] for sent in nltk.sent_tokenize(busterbrown)]
ball_sents = [[sent, "Chesterton"] for sent in nltk.sent_tokenize(ball)]
parents_sents = [[sent, "Edgeworth"] for sent in nltk.sent_tokenize(parents)]
moby_dick_sents = [[sent, "Melville"] for sent in nltk.sent_tokenize(moby_dick)]
paradise_sents = [[sent, "Milton"] for sent in nltk.sent_tokenize(paradise)]
hamlet_sents = [[sent, "Shakespeare"] for sent in nltk.sent_tokenize(hamlet)]

In [7]:
sentences = pd.DataFrame(alice_sents + persuasion_sents + poems_sents + stories_sents + busterbrown_sents + ball_sents + parents_sents + moby_dick_sents + paradise_sents + hamlet_sents)
sentences.head()

Unnamed: 0,0,1
0,Alice was beginning to get very tired of sitti...,Carroll
1,So she was considering in her own mind (as wel...,Carroll
2,There was nothing so VERY remarkable in that; ...,Carroll
3,Oh dear!,Carroll
4,I shall be late!',Carroll


# Creating clusters

## Decomposition

For decreasing the features size of our data we need to use some technique of decomposition. We can't simply use PCA for text vectors so we'll be comparing two other techniques which are used for reducing text features.
    
    1. TruncatedSVD or LSA (Latent Semantic Analysis)
    2. LDA (Latend Dirichlet Allocation)

In [8]:
X = sentences[0]
Y = sentences[1]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=44)

In [9]:
# making a tfidf vector for converting text data into numerical form
cluster_vectorizer = TfidfVectorizer()
X_cluster = cluster_vectorizer.fit_transform(X_train).toarray()
Y_cluster = Y_train

### LDA

In [10]:
lda = LatentDirichletAllocation()
x_lda = lda.fit_transform(X_cluster)

In [11]:
kmeans_lda = KMeans(n_clusters=10)
kmeans_lda.fit(x_lda)
y_predict = kmeans_lda.predict(x_lda)
print(silhouette_score(x_lda, kmeans_lda.labels_))
pd.crosstab(Y_cluster, y_predict)

0.5459476816969323


col_0,0,1,2,3,4,5,6,7,8,9
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Austen,1,149,11,9,5,10,4,8,4,3
Blake,0,8,2,1,2,4,3,0,2,1
Bryant,6,110,17,67,21,6,4,4,7,9
Burgess,2,42,11,12,7,3,0,4,4,1
Carroll,2,61,11,6,6,4,0,2,1,3
Chesterton,23,116,39,37,21,23,25,22,14,10
Edgeworth,13,379,52,41,24,18,10,11,19,8
Melville,61,239,75,98,79,53,79,51,41,37
Milton,15,20,21,21,9,12,20,11,13,13
Shakespeare,7,29,32,14,7,5,18,20,9,17


Our silhouette score is not so bad for LDA. Now let's check this data on Mean Shift algorithm to check what number of clusters it chooses.

In [12]:
meanshift_lda = MeanShift()
meanshift_lda.fit(x_lda)
y_pred = meanshift_lda.predict(x_lda)
print(silhouette_score(x_lda, meanshift_lda.labels_))
pd.crosstab(Y_cluster, y_pred)

0.38160194199803305


col_0,0,1,2
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Austen,180,11,13
Blake,20,1,2
Bryant,161,73,17
Burgess,62,12,12
Carroll,77,7,12
Chesterton,248,41,41
Edgeworth,474,46,55
Melville,621,106,86
Milton,112,22,21
Shakespeare,105,16,37


Mean Shift does not require to specify the number of clusters as it uses automatically chooses appropriate number of clusters and tries to fit our model based on that number. We only use this method when our data is small because it takes a lot of time to train.

For LDA MeanShift worked well but looking at crosstab we see that it only has 2 clusters. So, it makes it less reliable than k-means with 10 clusters as we know the number of authors in our data.

### TruncatedSVD

In [13]:
svd = TruncatedSVD(n_components=100, random_state=40)
x_svd = svd.fit_transform(X_cluster)

In [14]:
kmeans_svd = KMeans(n_clusters=10)
kmeans_svd.fit(x_svd)
y_predict = kmeans_svd.predict(x_svd)
print(silhouette_score(x_svd, kmeans_svd.labels_))
pd.crosstab(Y_cluster, y_predict)

0.023762496288718387


col_0,0,1,2,3,4,5,6,7,8,9
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Austen,4,4,4,0,35,22,33,52,0,50
Blake,0,0,0,0,0,0,5,10,0,8
Bryant,40,23,2,16,22,4,20,51,16,57
Burgess,3,1,5,0,0,13,28,24,0,12
Carroll,2,7,4,0,28,11,1,38,0,5
Chesterton,17,21,10,0,0,33,62,64,0,123
Edgeworth,50,74,26,1,82,39,116,111,0,76
Melville,9,25,22,0,7,63,88,331,1,267
Milton,0,0,3,0,0,0,19,59,0,74
Shakespeare,0,11,9,0,2,21,2,96,0,17


In [15]:
meanshift_svd = MeanShift()
meanshift_svd.fit(x_svd)
print(silhouette_score(x_svd, meanshift_svd.labels_))
y_pred = meanshift_svd.predict(x_svd)
pd.crosstab(Y_cluster, y_pred)

0.15018672629583116


col_0,0,1,2,3,4,5,6,7,8,9,...,43,44,45,46,47,48,49,50,51,52
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Austen,203,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Blake,23,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bryant,187,9,8,6,6,5,3,4,0,0,...,0,0,0,0,1,0,0,0,2,0
Burgess,84,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Carroll,91,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
Chesterton,320,0,0,0,0,0,0,1,0,2,...,0,0,0,0,0,0,0,0,0,1
Edgeworth,529,0,0,0,0,0,0,0,2,0,...,2,2,2,1,0,1,0,1,0,1
Melville,801,0,0,0,0,1,1,0,0,0,...,0,1,0,0,1,0,1,0,0,2
Milton,154,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Shakespeare,154,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Both these models performed very poorly for TruncatedSVD and as per other machine learning experts LDA tends to be much more reliable compared to LSA which is clearly observed in our scores.

In [16]:
sclustering = SpectralClustering(n_clusters=10, random_state=40)
y_pred = sclustering.fit_predict(x_lda)
print(silhouette_score(x_svd, sclustering.labels_))
pd.crosstab(Y_cluster, y_pred)

-0.10368864399299942


col_0,0,1,2,3,4,5,6,7,8,9
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Austen,7,7,2,161,5,1,5,7,6,3
Blake,1,1,3,9,2,0,2,4,0,1
Bryant,61,16,1,125,20,5,7,6,3,7
Burgess,10,9,0,51,5,1,4,3,2,1
Carroll,5,10,0,65,5,2,1,4,2,2
Chesterton,33,34,19,140,16,22,14,21,20,11
Edgeworth,31,40,8,421,18,13,17,14,7,6
Melville,91,68,74,297,69,52,42,48,41,31
Milton,19,19,17,32,9,13,14,10,9,13
Shakespeare,14,30,16,38,6,7,9,5,17,16


SpectralClustering is doing very poor job so we won't be looking into it.

In [17]:
sentences[1].value_counts()

Melville       1090
Edgeworth       778
Chesterton      439
Bryant          327
Austen          280
Milton          210
Shakespeare     210
Carroll         118
Burgess         108
Blake            28
Name: 1, dtype: int64

## Vectorizing methods

We will try TfidfVectorizer and CountVectorizer for converting our data into numerical form. Next we would compare both those models for higher accuracy.

In [18]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X_train)

count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(X_train)

# Supervised Modeling

Now let's make some models with labels available to us. We'll try 3 different supervised models, RandomForestClassifier, GradientBoostingClassifier and LogisticRegression and see which works best.

### RandomForestClassifier

In [19]:
# random forest with tfidf vectorizer
rfc_tfidf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=30)
rfc_tfidf.fit(X_tfidf, Y_train)
print("Tfidf Train score: ", rfc_tfidf.score(X_tfidf, Y_train))

# random forest with count vectorizer
rfc_count = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=30)
rfc_count.fit(X_count, Y_train)
print("Count Train score: ", rfc_count.score(X_count, Y_train))

Tfidf Train score:  0.30992196209587514
Count Train score:  0.3110367892976589


### GradientBoostingClassifier

In [20]:
# Gradient boosting with tfidf vectorizer
gbc_tfidf = GradientBoostingClassifier()
gbc_tfidf.fit(X_tfidf, Y_train)
print("Tfidf Train score: ", gbc_tfidf.score(X_tfidf, Y_train))

# Gradient boosting with count vectorizer
gbc_count = GradientBoostingClassifier()
gbc_count.fit(X_count, Y_train)
print("Count Train score: ", gbc_count.score(X_count, Y_train))

Tfidf Train score:  0.9126718691936083
Count Train score:  0.872166480862133


### LogisticRegression

In [21]:
# Logistic Regression with tfidf Vectorizer
lr_tfidf = LogisticRegression()
lr_tfidf.fit(X_tfidf, Y_train)
print("Tfidf Train score: ", lr_tfidf.score(X_tfidf, Y_train))

# Logistic Regression with count vectorizer
lr_count = LogisticRegression()
lr_count.fit(X_count, Y_train)
print("Count Train score: ", lr_count.score(X_count, Y_train))



Tfidf Train score:  0.7959866220735786
Count Train score:  0.9609810479375697


Best model from supervised learning is Logistic Regression with accuracy of 95% in both train and test dataset. Best performing vectorizer is CountVectorizer in every model. TfidfVectorizer tends to overfitting in LogisticRegression.

# Unsupervise modeling

#### Count vectorizer

In [22]:
lda = LatentDirichletAllocation()
x_lda = lda.fit_transform(X_count)

In [23]:
x_lda

array([[0.2243359 , 0.00333403, 0.00333405, ..., 0.00333392, 0.00333389,
        0.00333383],
       [0.01111248, 0.01111241, 0.01111294, ..., 0.01111313, 0.89997469,
        0.01111278],
       [0.00909287, 0.39039528, 0.0090923 , ..., 0.53686077, 0.00909226,
        0.00909284],
       ...,
       [0.0100034 , 0.01000266, 0.01000101, ..., 0.01000144, 0.01000275,
        0.01000211],
       [0.00357248, 0.0035723 , 0.00357176, ..., 0.00357197, 0.00357224,
        0.00357249],
       [0.00250052, 0.47845943, 0.50153684, ..., 0.00250039, 0.00250046,
        0.00250058]])

In [24]:
X_train_count = pad_sequences(x_lda)

In [25]:
X_count.shape

(2691, 7758)

In [26]:
model = Sequential()
model.add(Embedding(2600, 128,input_length = X_train_count.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(256, dropout_U=0.2, dropout_W=0.2, return_sequences=False))
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

W0715 14:53:52.139939 14484 deprecation_wrapper.py:119] From C:\Users\vivek\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0715 14:53:52.172850 14484 deprecation_wrapper.py:119] From C:\Users\vivek\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0715 14:53:52.179832 14484 deprecation_wrapper.py:119] From C:\Users\vivek\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0715 14:53:52.200808 14484 deprecation_wrapper.py:119] From C:\Users\vivek\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0715 14:53:52.208783 14484 deprecation.py:506

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 10, 128)           332800    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 10, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               394240    
_________________________________________________________________
dense_1 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
Total params: 761,226
Trainable params: 761,226
Non-trainable params: 0
_________________________________________________________________


In [27]:
Y_train.value_counts()

Melville       813
Edgeworth      575
Chesterton     330
Bryant         251
Austen         204
Shakespeare    158
Milton         155
Carroll         96
Burgess         86
Blake           23
Name: 1, dtype: int64

In [28]:
y_train = pd.get_dummies(Y_train)

In [29]:
print(y_train)

      Austen  Blake  Bryant  Burgess  Carroll  Chesterton  Edgeworth  \
1390       0      0       0        0        0           0          1   
2852       0      0       0        0        0           0          0   
1669       0      0       0        0        0           0          1   
1430       0      0       0        0        0           0          1   
944        0      0       0        0        0           1          0   
3225       0      0       0        0        0           0          0   
946        0      0       0        0        0           1          0   
1502       0      0       0        0        0           0          1   
1391       0      0       0        0        0           0          1   
3152       0      0       0        0        0           0          0   
212        1      0       0        0        0           0          0   
382        1      0       0        0        0           0          0   
2803       0      0       0        0        0           0       

In [30]:
print(X_train_count.shape, y_train.shape)

(2691, 10) (2691, 10)


In [31]:
history = model.fit(X_train_count, y_train, epochs=5, batch_size=64, verbose=2)

Epoch 1/5
 - 2s - loss: 0.2953 - acc: 0.8989
Epoch 2/5
 - 1s - loss: 0.2885 - acc: 0.9000
Epoch 3/5
 - 1s - loss: 0.2881 - acc: 0.9000
Epoch 4/5
 - 1s - loss: 0.2881 - acc: 0.9000
Epoch 5/5
 - 1s - loss: 0.2884 - acc: 0.9000


In [32]:
y_predict_train = model.predict_classes(X_train_count)
print(y_predict.shape)

(2691,)


#### Tfidf Vectorizer

In [36]:
lda = LatentDirichletAllocation()
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
x_lda = lda.fit_transform(X_train_tfidf)
x_test_tfidf = lda.transform(X_test_tfidf)

In [37]:
X_train_tfidf = pad_sequences(x_lda)

In [39]:
print(X_train_tfidf.shape, Y_train.shape)
print(x_test_tfidf.shape, Y_test.shape)

(2691, 10) (2691,)
(897, 10) (897,)


In [41]:
history = model.fit(X_train_tfidf, y_train, epochs=10, batch_size=5, verbose=2)

Epoch 1/10
 - 10s - loss: 0.2899 - acc: 0.9000
Epoch 2/10
 - 10s - loss: 0.2887 - acc: 0.9000
Epoch 3/10
 - 9s - loss: 0.2885 - acc: 0.9000
Epoch 4/10
 - 9s - loss: 0.2881 - acc: 0.9000
Epoch 5/10
 - 9s - loss: 0.2880 - acc: 0.9000
Epoch 6/10
 - 10s - loss: 0.2881 - acc: 0.9000
Epoch 7/10
 - 9s - loss: 0.2880 - acc: 0.9000
Epoch 8/10
 - 9s - loss: 0.2878 - acc: 0.9000
Epoch 9/10
 - 9s - loss: 0.2879 - acc: 0.9000
Epoch 10/10
 - 9s - loss: 0.2877 - acc: 0.9000


It looks like model has 90% accuracy but there's something wrong with this model as it is constant from starting epoch to end.

## Test data validation

In [42]:
x_tfidf = cluster_vectorizer.transform(X_test).toarray()
x_test = lda.transform(x_tfidf)
y_test = Y_test

In [43]:
y_predict = kmeans_lda.predict(x_test)
print(silhouette_score(x_test, y_predict))
pd.crosstab(y_test, y_predict)

0.5774591681508521


col_0,0,1,2,3,4,5,6,7,8,9
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Austen,0,0,0,0,0,0,76,0,0,0
Blake,0,0,0,0,0,0,3,1,1,0
Bryant,0,0,1,0,0,3,69,0,1,2
Burgess,0,0,0,0,0,0,22,0,0,0
Carroll,0,0,0,0,0,0,22,0,0,0
Chesterton,0,1,3,0,0,1,98,5,0,1
Edgeworth,0,0,0,1,1,1,199,0,0,1
Melville,4,4,3,5,8,5,228,6,2,12
Milton,1,0,1,1,0,0,51,0,0,1
Shakespeare,2,0,0,4,0,0,39,3,0,4


Silhouette score looks good for test data also for clustering algorithms

In [44]:
X_test_tfidf = tfidf_vectorizer.transform(X_test)
X_test_count = count_vectorizer.transform(X_test)

#### Random Forest Test Score

In [45]:
print("Tfidf Test score: ", rfc_tfidf.score(X_test_tfidf, Y_test))
print("Count Test score: ", rfc_count.score(X_test_count, Y_test))

Tfidf Test score:  0.31438127090301005
Count Test score:  0.3166109253065775


#### Gradient Boosting Test Score

In [46]:
print("Tfidf Test score: ", gbc_tfidf.score(X_test_tfidf, Y_test))
print("Count Test score: ", gbc_count.score(X_test_count, Y_test))

Tfidf Test score:  0.6744704570791528
Count Test score:  0.6867335562987736


#### Logistic Regression Test Score

In [47]:
print("Tfidf Test score: ", lr_tfidf.score(X_test_tfidf, Y_test))
print("Count Test score: ", lr_count.score(X_test_count, Y_test))

Tfidf Test score:  0.6744704570791528
Count Test score:  0.7302118171683389


# Conclusion

So in conclusion our clustering is working good in train and test dataset as silhouette score is consistent in both data. Logistic Regressor is best performer in supervised models which has accuracy of 95% in training dataset and 73% in testing dataset.