<h1><center>Blog Author Identification</h1>

For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

## Data Preparation

#### Load libraries

In [1]:
import pandas as pd
import scipy
import numpy as np
import math
import datetime

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

#### Load CSV 

In [2]:
df0 = pd.read_csv('../data/blogtext.csv', delimiter=',',usecols=['id','topic','date','text'])
df0.dataframeName = 'blogtext.csv'
nRow, nCol = df0.shape
print(f'There are {nRow} total blog entries.')

There are 681284 total blog entries.


#### Data Cleaning

Drop duplicates

In [3]:
df0.drop_duplicates(subset="text",inplace=True)

Remove blogs using non-English months in their date fields.

In [4]:
date_strings = df0['date'].str.title()
english_month = (date_strings.str.contains(',January,')) | (date_strings.str.contains(',February,')) | \
               (date_strings.str.contains(',March,')) | (date_strings.str.contains(',April,')) | \
               (date_strings.str.contains(',May,')) | (date_strings.str.contains(',June,')) | \
               (date_strings.str.contains(',July,')) | (date_strings.str.contains(',August,')) | \
               (date_strings.str.contains(',September,')) | (date_strings.str.contains(',October,')) | \
               (date_strings.str.contains(',November,')) | (date_strings.str.contains(',December,'))
df = df0.drop(df0[~english_month].index)

In [5]:
df['datetime'] = pd.to_datetime(df['date'].str.title(), format='%d,%B,%Y')
df = df.drop(columns=['date'])
nRow, nCol = df.shape
print(f'After filtering for English months, there are {nRow} remaining blog entries.')

After filtering for English months, there are 604518 remaining blog entries.


#### Filter blog entries

Count number of posts by author and topic

In [6]:
post_count = df.groupby('id').size()
df['post_count'] = [ post_count[id] for id in df['id'] ]

In [7]:
topic_count = df.groupby('topic').size()
df['topic_count'] = [ topic_count[topic] for topic in df['topic'] ]

The most heavily posted topic is 'Student' so focus on those posts.

In [8]:
df = df[df['topic'] == 'Student']
df.groupby('id').size()

id
11762        20
48923       126
71075         3
105748       53
106160       18
114645      213
216553       34
216833      221
278367        4
298924      179
420079       72
424159       60
445263       45
452696       21
466858       25
472101       58
477017       91
479019     1001
491348       26
495541      306
498684       69
514171        4
514483       22
522409       82
531918       23
534904      113
539545        8
550590      180
567598       20
577518       30
           ... 
4315396       2
4316701       3
4316835       1
4317094       3
4317294       3
4317829       4
4317864       3
4317978       3
4318162       3
4319396       4
4319763       2
4320317       1
4320557       7
4320585       2
4321212       3
4321513       2
4322723       4
4323050       2
4323690       2
4325889      15
4326228       8
4329299       2
4330295       2
4330562       2
4330772       3
4331320       6
4333070       1
4335412       2
4336267       6
4337133       2
Length: 5003, dtype: 

Select only authors with at least 500 posts.

In [9]:
df = df[df['post_count'] >= 500]

In [10]:
df.groupby('id').size()

id
479019     1001
642604      545
780903     1337
925742      542
944569     1250
955372     1749
988941     1534
1000866     770
1119650     730
1157144    1067
1784456    1829
1889734    1013
1999563     623
2297959     665
3667467     646
dtype: int64

In [11]:
len(df)

15301

There are 15 remaining authors with a total of 15,301 posts.

#### Clean up DataFrame

In [12]:
df = df.drop(columns = ['topic','datetime','post_count','topic_count'])
df = df.reset_index(drop=True)
#df.to_csv('reduced_blog_corpus.csv')

del df0

In [13]:
df.head(5)

Unnamed: 0,id,text
0,988941,"No one has joined yet, because I've..."
1,988941,"This is a group blog for, er, a gro..."
2,988941,About the urlLink snooboo.com Foru...
3,988941,No the goverment won't increase pay...
4,988941,"Aaaaaahhhhh, parents evening. My da..."


#### Create training/test set split

In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(df['text'],df['id'], test_size=0.25)

## Data Processing

#### Process posts with spaCy

In [15]:
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Get spaCy stopwords
nlp = spacy.load('c:/users/yfsta/anaconda3/lib/site-packages/en_core_web_lg/en_core_web_lg-2.1.0')
stop_words = list(spacy.lang.en.stop_words.STOP_WORDS)

In [16]:
# Initialize English parser
parser = English()

# Create spaCy tokenizer
def spacy_tokenizer(sentence):
    
    # Create tokens object from parser
    tokens = nlp(sentence)
#    tokens = parser(sentence)

    # Remove words not included in NLP vocab list (oov = out of vocabulary)
    tokens = [ t for t in tokens if t.is_oov is False ]
    
    # Exclusion list
    exclusion_list = list(string.punctuation)
    exclusion_list.extend(list(stop_words))
    
    # Lemmatize, convert to lowercase and remove extra spaces
    tokens = [ w.lemma_.lower().strip() for w in tokens if w.lemma_ not in exclusion_list ]
    
    return tokens

In [17]:
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Get spaCy stopwords
nlp = spacy.load('c:/users/yfsta/anaconda3/lib/site-packages/en_core_web_lg/en_core_web_lg-2.1.0')

#### Bag of Words Vectorization

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [19]:
bow_train_out = bow_vector.fit_transform(X_train)
bow_test_out = bow_vector.fit_transform(X_test)

In [20]:
print('The dimensions of the BoW training set are: ' + str(bow_train_out.shape))
print('The dimensions of the BoW test set are: ' + str(bow_test_out.shape))

The dimensions of the BoW training set are: (11475, 31178)
The dimensions of the BoW test set are: (3826, 19218)


In [21]:
from sklearn.preprocessing import normalize

bow_train_out = normalize(bow_train_out)
bow_test_out = normalize(bow_test_out)

#### tf-idf Vectorization

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [23]:
tfidf_train_out = tfidf_vector.fit_transform(X_train)
tfidf_test_out = tfidf_vector.fit_transform(X_test)

In [24]:
print('The dimensions of the tf-idf training set are: ' + str(tfidf_train_out.shape))
print('The dimensions of the tf-idf test set are: ' + str(tfidf_test_out.shape))

The dimensions of the tf-idf training set are: (11475, 31178)
The dimensions of the tf-idf test set are: (3826, 19218)


In [25]:
tfidf_train_out = normalize(tfidf_train_out)
tfidf_test_out = normalize(tfidf_test_out)

#### Reduce dimensionality

In [26]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

In [27]:
bow_svd = TruncatedSVD(300)
bow_train_lsa = bow_svd.fit_transform(bow_train_out)

variance_explained = bow_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance of BoW training set captured by all components:",total_variance*100)

Percent variance of BoW training set captured by all components: 64.91124307080909


In [28]:
bow_svd = TruncatedSVD(200)
bow_test_lsa = bow_svd.fit_transform(bow_test_out)

variance_explained=bow_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance of BoW test captured by all components:",total_variance*100)

Percent variance of BoW test captured by all components: 61.73356435235304


In [29]:
tfidf_svd= TruncatedSVD(1200)
tfidf_train_lsa = tfidf_svd.fit_transform(tfidf_train_out)

variance_explained=tfidf_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance tf-idf training captured by all components:",total_variance*100)

Percent variance tf-idf training captured by all components: 59.20365463323159


In [30]:
tfidf_svd= TruncatedSVD(800)
tfidf_test_lsa = tfidf_svd.fit_transform(tfidf_test_out)

variance_explained=tfidf_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance tf-idf test captured by all components:",total_variance*100)

Percent variance tf-idf test captured by all components: 60.41813557655089


## Clustering

#### K-Means

In [31]:
from sklearn.cluster import KMeans

# Initialize and fit K-Means
bow_full_pred = KMeans(n_clusters=15, random_state=42).fit_predict(bow_train_out)

# Check actual vs. predicted
pd.crosstab(Y_train, bow_full_pred) 

col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
479019,21,76,18,31,105,79,0,109,4,2,34,0,129,81,80
642604,14,115,15,24,58,30,0,54,2,0,14,0,29,44,31
780903,40,47,47,26,310,3,0,269,1,0,4,0,15,126,108
925742,5,73,16,9,36,8,21,88,0,4,1,0,3,101,12
944569,45,136,46,31,82,79,0,136,3,0,22,83,47,174,49
955372,146,268,119,134,47,56,0,117,12,26,15,2,34,216,129
988941,165,131,97,209,62,18,0,79,6,0,1,0,15,98,239
1000866,46,16,12,190,23,13,0,17,3,0,72,0,35,19,123
1119650,3,104,8,16,125,7,0,107,2,0,2,0,0,124,40
1157144,53,107,65,44,71,66,0,101,6,1,43,3,77,84,100


In [32]:
from sklearn.cluster import MiniBatchKMeans

# Initialize and fit Mini-Batch K-Means
minibatchkmeans = MiniBatchKMeans(init='random', n_clusters=15, batch_size=1000)
minibatchkmeans.fit(bow_train_out)

predict_mini = minibatchkmeans.predict(bow_train_out)

# Check the Mini-Batch model against our earlier one
print('Comparing K-Means and Mini-Batch K-Means solutions:')
print(pd.crosstab(predict_mini, bow_full_pred))

Comparing K-Means and Mini-Batch K-Means solutions:
col_0   0    1    2     3     4    5   6     7   8   9    10  11   12   13   14
row_0                                                                          
0        0    0    0  1244     0    0   1     0   0   2    3   0    0    0  329
1       43    0    0     0   248    0   2  1533   8   0    0   0   39  886    0
2        0  734  216     0     0  121   1     0  19   2    0   0    0  576    0
3      364    0  381     0     0   34  15    33  20  24    0   0  173   19    0
4        0    0    0     0     0    0   0     0   0   0    1  99    7    0    1
5        2   14   28     0     0    8   0     2   1   0    0   1    1   19    0
6        0  753    0     0     0    1   0     0   1   0    0   0    0   17    0
7        0    0    1     0     0    0   0     0   0   0    0   0    0    0    0
8        0    0    0     0     0    0   0     0   0   0    0   0    0    0    1
9        0    0    1     0     0    0   0     0   0   0    0   0    

Mini-Batch runs much faster than a full K-Means and for most bloggers, the majority of results are consistent. However, there are some exceptions that are split fairly evenly between a couple of buckets.

In [33]:
from sklearn import metrics
    
bow_kmeans_ari = metrics.adjusted_rand_score(Y_train, bow_full_pred)
bow_kmeans_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), bow_full_pred, metric='euclidean')

print('ARI for K-Means BoW is: ' + str(bow_kmeans_ari))
print('Silhouette score for K-Means BoW is: ' + str(bow_kmeans_sil))

ARI for K-Means BoW is: 0.0316406610846862
Silhouette score for K-Means BoW is: -0.5963023830976139


K-Means BoW appears only slightly better than random.

In [34]:
# Initialize and fit K-Means
tfidf_full_pred = KMeans(n_clusters=15, random_state=42).fit_predict(tfidf_train_out)

# Check actual vs predicted
pd.crosstab(Y_train, tfidf_full_pred) 

col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
479019,116,0,15,2,0,6,107,94,142,1,0,2,19,54,211
642604,88,4,1,3,9,26,119,48,31,4,0,0,1,50,46
780903,206,3,10,0,5,1,96,161,75,6,368,0,8,43,14
925742,88,1,0,8,7,0,119,106,21,3,2,2,0,14,6
944569,191,10,1,14,2,14,199,147,56,45,0,140,15,28,71
955372,444,21,4,28,0,16,343,103,192,33,0,11,4,65,57
988941,361,2,5,31,20,10,150,33,342,61,0,0,11,80,14
1000866,60,1,0,11,31,0,20,12,137,1,0,0,22,217,57
1119650,73,13,2,5,0,1,173,152,48,15,0,0,0,51,5
1157144,249,5,2,14,4,4,154,79,142,2,0,17,5,31,113


In [35]:
# Initialize and fit Mini-Batch K-Means
minibatchkmeans = MiniBatchKMeans(init='random', n_clusters=15, batch_size=1000)
minibatchkmeans.fit(tfidf_train_out)

predict_mini = minibatchkmeans.predict(tfidf_train_out)

print('Comparing K-Means and Mini-Batch K-Means solutions:')
print(pd.crosstab(predict_mini, tfidf_full_pred))

Comparing K-Means and Mini-Batch K-Means solutions:
col_0    0   1    2   3   4    5     6    7     8    9    10   11  12   13   14
row_0                                                                          
0         0   0    0   0   0    0     0    0     1    0    0    0   0    0    0
1        38   0    3   3   5    0     1    9    32  207    0    0   6    0   10
2        95   7   21  44  35  101  1956  314     0    4    0    0  27    0  121
3        36   0    1   1   0    0    30    4     3    0    0    0   3    0   10
4      2304   2  183  60  61    0     3  184   596    0    0    0  42    0  409
5         0   0    0   0   0    0     0    0     0    0    1    0   0    0    0
6        48   0    0   2  13    0    39   35    16   14    2    0   6    0    9
7         0  69    0   5   1    0     0    1     4    0    0    0   0    1    2
8        18   0    0   0   0    0     8    6     4    0    0    0   1    0    4
9        24   1    0   4   6    0     4    7    10    2    0  228   

Mini-Batch runs much faster than a full K-Means but is does not have consistent results for many of the bloggers.

In [36]:
tfidf_kmeans_ari = metrics.adjusted_rand_score(Y_train, tfidf_full_pred)
tfidf_kmeans_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), tfidf_full_pred, metric='euclidean')

print('ARI for K-Means tf-idf is: ' + str(tfidf_kmeans_ari))
print('Silhouette score for K-Means tf-idf is: ' + str(tfidf_kmeans_sil))

ARI for K-Means tf-idf is: 0.04224020946527432
Silhouette score for K-Means tf-idf is: -0.49868350058777094


K-Means tf-idf does not perform much better than BoW.

#### Mean-Shift

In [37]:
from sklearn.cluster import MeanShift, estimate_bandwidth

# Estimate bandwidth based on data
bandwidth = estimate_bandwidth(bow_train_out.toarray(), quantile=0.5, n_samples=500, n_jobs=-1)

# Initialize and fit model
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True, min_bin_freq=10, n_jobs=3)
ms.fit(bow_train_out.toarray())

# Extract cluster assignments for each data point.
labels = ms.labels_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated Mean-Shift BoW clusters: {}".format(n_clusters_))

Number of estimated Mean-Shift BoW clusters: 1


In [38]:
bow_full_pred = ms.predict(bow_train_out.toarray())

In [40]:
# Check actual vs predicted
pd.crosstab(Y_train, bow_full_pred)

bow_mean_shift_ari = metrics.adjusted_rand_score(Y_train, bow_full_pred)
bow_mean_shift_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), bow_full_pred, metric='euclidean')

print('ARI for Mean-Shift BoW is: ' + str(bow_kmeans_ari))
print('Silhouette score for Mean-Shift BoW is: ' + str(bow_mean_shift_sil))

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

In [None]:
# Estimate bandwidth based on data
bandwidth = estimate_bandwidth(tfidf_train_out.toarray(), quantile=0.2, n_samples=500, n_jobs=3)

# Initialize and fit model
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True, min_bin_freq=10, n_jobs=3)
ms.fit(tfidf_train_out.toarray())

# Extract cluster assignments for each data point.
labels = ms.labels_

# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated tf-idf Mean-Shift clusters: {}".format(n_clusters_))

In [None]:
tfidf_full_pred = ms.predict(tfidf_train_out.toarray())

In [None]:
# Check actual vs predicted
pd.crosstab(Y_train, tfidf_full_pred)

tfidf_mean_shift_ari = metrics.adjusted_rand_score(Y_train, tfidf_full_pred)
tfidf_mean_shift_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), tfidf_full_pred, metric='euclidean')

print('ARI for Mean-Shift tf-idf is: ' + str(tfidf_kmeans_ari))
print('Silhouette score for Mean-Shift tf-idf is: ' + str(tfidf_kmeans_sil))

#### Spectral Clustering

In [41]:
from sklearn.cluster import SpectralClustering

# Declare and fit the model
sc = SpectralClustering(n_clusters=15)
sc.fit(bow_train_out)

#Predict clusters
bow_full_pred = sc.fit_predict(bow_train_out)

In [42]:
# Check actual vs predicted
print(pd.crosstab(Y_train,bow_full_pred))

bow_sc_ari = metrics.adjusted_rand_score(Y_train, bow_full_pred)
bow_sc_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), bow_full_pred, metric='euclidean')

print('ARI for Spectral Clustering BoW is: ' + str(bow_sc_ari))
print('Silhouette score for Spectral Clustering BoW is: ' + str(bow_sc_sil))

col_0     0    1   2   3   4    5    6   7    8    9    10   11  12  13   14
id                                                                          
479019    45   11  47  86   1   57  196  18   79   35    0   83  38  32   41
642604   101   32   8  21   0   48  116  26    8   10    0   16  12  16   16
780903    36   56  19   6   0   93  402  24  276   15    0   24   7  33    5
925742    56   35   7   6   4   19  165  10   15   17    1    7  14  20    1
944569   106  109  30  54   0   51  292  54   10   26  109    6  23  36   27
955372   207  257  42  48  22  116  260  39   37  136    8   38  50  47   14
988941   103  173  60  15   0  276  149  35   48   37    0  100  74  47    3
1000866   10   34  65   9   0  165   38  43   16   38    0   18  41  20   72
1119650   97   21  13   6   0   62  270   7    9   16    0   13   9  13    2
1157144   75  108  52  60   1   81  198  59   24   29    4   22  25  31   52
1784456   78  154  79  41   0  255  215  39   50   42    0   27  32  35  344

In [43]:
# Declare and fit the model
sc = SpectralClustering(n_clusters=15)
sc.fit(tfidf_train_out)

#Predict clusters
tfidf_full_pred = sc.fit_predict(tfidf_train_out)

In [44]:
# Check actual vs predicted
print(pd.crosstab(Y_train,tfidf_full_pred))

tfidf_sc_ari = metrics.adjusted_rand_score(Y_train, tfidf_full_pred)
tfidf_sc_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), tfidf_full_pred, metric='euclidean')

print('ARI for Spectral Clustering tf-idf is: ' + str(tfidf_sc_ari))
print('Silhouette score for Spectral Clustering tf-idf is: ' + str(tfidf_sc_sil))

col_0     0   1    2   3   4   5    6    7   8    9    10  11   12  13   14
id                                                                         
479019   124   0   46   1   0   0   58    2   0    2  168   1  285   3   79
642604   190   0   26   0   0   4   86    0   0    0   47  20   24   5   28
780903   273   0   52   0   0   3  155    0   0  427   13   0   47   9   17
925742   136   0   16   4   0   1  101    1  21    4    9   0   49   4   31
944569   338   0   56   0   5  11  188  129   0    0   80   7   23  60   36
955372   625   0  163  27   3  21  151   11   0    3   60   7  102  40  108
988941   370   0  280   0   4   2   41    0   0    0   17   6  300  69   31
1000866   60   0   91   0   0   2   18    0   0    0   46   0  141   3  208
1119650  204   0   45   0   0  13  215    0   0    0    6   0   28  16   11
1157144  333   0  117   2   4   4   76   13   0    1  120   3   72   2   74
1784456  337   0  532   0   1  11   84    0   0    1  246   9   55  45   70
1889734  490

## Unsupervised Feature Generation

## Model Comparison

## Conclusions