<h1><center>Blog Author Identification</h1>

For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

## Data Preparation

#### Load libraries

In [None]:
import pandas as pd
import scipy
import numpy as np
import math
import datetime

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

#### Load CSV 

In [None]:
df0 = pd.read_csv('../data/blogtext.csv', delimiter=',',usecols=['id','topic','date','text'])
df0.dataframeName = 'blogtext.csv'
nRow, nCol = df0.shape
print(f'There are {nRow} total blog entries.')

#### Data Cleaning

Drop duplicates

In [None]:
df0.drop_duplicates(subset="text",inplace=True)

Remove blogs using non-English months in their date fields.

In [None]:
date_strings = df0['date'].str.title()
english_month = (date_strings.str.contains(',January,')) | (date_strings.str.contains(',February,')) | \
               (date_strings.str.contains(',March,')) | (date_strings.str.contains(',April,')) | \
               (date_strings.str.contains(',May,')) | (date_strings.str.contains(',June,')) | \
               (date_strings.str.contains(',July,')) | (date_strings.str.contains(',August,')) | \
               (date_strings.str.contains(',September,')) | (date_strings.str.contains(',October,')) | \
               (date_strings.str.contains(',November,')) | (date_strings.str.contains(',December,'))
df = df0.drop(df0[~english_month].index)

In [None]:
df['datetime'] = pd.to_datetime(df['date'].str.title(), format='%d,%B,%Y')
df = df.drop(columns=['date'])
nRow, nCol = df.shape
print(f'After filtering for English months, there are {nRow} remaining blog entries.')

#### Filter blog entries

Count number of posts by author and topic

In [None]:
post_count = df.groupby('id').size()
df['post_count'] = [ post_count[id] for id in df['id'] ]

In [None]:
topic_count = df.groupby('topic').size()
df['topic_count'] = [ topic_count[topic] for topic in df['topic'] ]

The most heavily posted topic is 'Student' so focus on those posts.

In [None]:
df = df[df['topic'] == 'Student']
df.groupby('id').size()

Select only authors with at least 500 posts.

In [None]:
df = df[df['post_count'] >= 500]

In [None]:
df.groupby('id').size()

In [None]:
len(df)

There are 15 remaining authors with a total of 15,301 posts.

#### Clean up DataFrame

In [None]:
df = df.drop(columns = ['topic','datetime','post_count','topic_count'])
#df.to_csv('reduced_blog_corpus.csv')

del df0

In [None]:
df.head(5)

#### Create training/test set split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df['text'],df['id'], test_size=0.25)

## Data Processing

#### Process posts with spaCy

In [1]:
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Get spaCy stopwords
nlp = spacy.load('en')
stop_words = list(spacy.lang.en.stop_words.STOP_WORDS)

# Initialize English parser
parser = English()

# Create spaCy tokenizer
def spacy_tokenizer(sentence):
    
    # Create tokens object from parser
    tokens = parser(sentence)

    # Remove words not included in NLP vocab list (oov = out of vocabulary)
    tokens = [ t for t in tokens if t.is_oov is False ]
    
    # Exclusion list
    exclusion_list = list(string.punctuation)
    exclusion_list.extend(list(stop_words))
    
    # Lemmatize, convert to lowercase and remove extra spaces
    tokens = [ w.lemma_.strip().lower() for w in tokens if w.lemma_ not in exclusion_list ]
    return tokens

ValueError: could not broadcast input array from shape (96) into shape (128)

In [1]:
import spacy

nlp = spacy.load('en')

doc = nlp('I am sflmgmavknsaccasas dog cat bird bulbasaur')

for token in tokens: 
    # Printing the following attributes of each token. 
    # text: the word string, has_vector: if it contains 
    # a vector representation in the model,  
    # vector_norm: the algebraic norm of the vector, 
    # is_oov: if the word is out of vocabulary. 
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) 

ValueError: could not broadcast input array from shape (96) into shape (128)

#### Bag of Words Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [None]:
bow_train_out = bow_vector.fit_transform(X_train)
bow_test_out = bow_vector.fit_transform(X_test)

In [None]:
print('The dimensions of the BoW training set are: ' + str(bow_train_out.shape))
print('The dimensions of the BoW test set are: ' + str(bow_test_out.shape))

In [None]:
from sklearn.preprocessing import normalize

bow_train_out = normalize(bow_train_out)
bow_test_out = normalize(bow_test_out)

#### tf-idf Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [None]:
tfidf_train_out = tfidf_vector.fit_transform(X_train)
tfidf_test_out = tfidf_vector.fit_transform(X_test)

In [None]:
print('The dimensions of the tf-idf training set are: ' + str(tfidf_train_out.shape))
print('The dimensions of the tf-idf test set are: ' + str(tfidf_test_out.shape))

In [None]:
tfidf_train_out = normalize(tfidf_train_out)
tfidf_test_out = normalize(tfidf_test_out)

#### Reduce dimensionality

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

In [None]:
bow_svd = TruncatedSVD(1000)
bow_train_lsa = bow_svd.fit_transform(bow_train_out)

variance_explained = bow_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance of BoW training set captured by all components:",total_variance*100)

In [None]:
bow_svd = TruncatedSVD(500)
bow_test_lsa = bow_svd.fit_transform(bow_test_out)

variance_explained=bow_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance of BoW test captured by all components:",total_variance*100)

In [None]:
tfidf_svd= TruncatedSVD(1000)
tfidf_train_lsa = tfidf_svd.fit_transform(tfidf_train_out)

variance_explained=tfidf_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance tf-idf training captured by all components:",total_variance*100)

In [None]:
tfidf_svd= TruncatedSVD(500)
tfidf_test_lsa = tfidf_svd.fit_transform(tfidf_test_out)

variance_explained=tfidf_svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance tf-idf test captured by all components:",total_variance*100)

## Clustering

#### K-Means

In [None]:
from sklearn.cluster import KMeans

# Initialize and fit K-Means
bow_full_pred = KMeans(n_clusters=15, random_state=42).fit_predict(bow_train_out)

# Check actual vs. predicted
pd.crosstab(Y_train, bow_full_pred) 

In [None]:
from sklearn.cluster import MiniBatchKMeans

# Initialize and fit Mini-Batch K-Means
minibatchkmeans = MiniBatchKMeans(init='random', n_clusters=15, batch_size=1000)
minibatchkmeans.fit(bow_train_out)

predict_mini = minibatchkmeans.predict(bow_train_out)

# Check the Mini-Batch model against our earlier one
print('Comparing K-Means and Mini-Batch K-Means solutions:')
print(pd.crosstab(predict_mini, bow_full_pred))

Mini-Batch runs much faster than a full K-Means and for most bloggers, the majority of results are consistent. However, there are some exceptions that are split fairly evenly between a couple of buckets.

In [None]:
from sklearn import metrics
    
bow_kmeans_ari = metrics.adjusted_rand_score(Y_train, bow_full_pred)
bow_kmeans_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), bow_full_pred, metric='euclidean')

print('ARI for K-Means BoW is: ' + str(bow_kmeans_ari))
print('Silhouette score for K-Means BoW is: ' + str(bow_kmeans_sil))

K-Means BoW appears only slightly better than random.

In [None]:
# Initialize and fit K-Means
tfidf_full_pred = KMeans(n_clusters=15, random_state=42).fit_predict(tfidf_train_out)

# Check actual vs predicted
pd.crosstab(Y_train, tfidf_full_pred) 

In [None]:
# Initialize and fit Mini-Batch K-Means
minibatchkmeans = MiniBatchKMeans(init='random', n_clusters=15, batch_size=1000)
minibatchkmeans.fit(tfidf_train_out)

predict_mini = minibatchkmeans.predict(tfidf_train_out)

print('Comparing K-Means and Mini-Batch K-Means solutions:')
print(pd.crosstab(predict_mini, tfidf_full_pred))

Mini-Batch runs much faster than a full K-Means but is does not have consistent results for many of the bloggers.

In [None]:
tfidf_kmeans_ari = metrics.adjusted_rand_score(Y_train, tfidf_full_pred)
tfidf_kmeans_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), tfidf_full_pred, metric='euclidean')

print('ARI for K-Means tf-idf is: ' + str(tfidf_kmeans_ari))
print('Silhouette score for K-Means tf-idf is: ' + str(tfidf_kmeans_sil))

K-Means tf-idf does not perform much better than BoW.

#### Mean-Shift

In [None]:
from sklearn.cluster import MeanShift, estimate_bandwidth

# Estimate bandwidth based on data
bandwidth = estimate_bandwidth(bow_train_out.toarray(), quantile=0.5, n_samples=500, n_jobs=-1)

# Initialize and fit model
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True, min_bin_freq=10, n_jobs=3)
ms.fit(bow_train_out.toarray())

# Extract cluster assignments for each data point.
labels = ms.labels_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated Mean-Shift BoW clusters: {}".format(n_clusters_))

In [None]:
bow_full_pred = ms.predict(bow_train_out.toarray())

In [None]:
# Check actual vs predicted
pd.crosstab(Y_train, bow_full_pred)

bow_mean_shift_ari = metrics.adjusted_rand_score(Y_train, bow_full_pred)
bow_mean_shift_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), bow_full_pred, metric='euclidean')

print('ARI for Mean-Shift BoW is: ' + str(bow_kmeans_ari))
print('Silhouette score for Mean-Shift BoW is: ' + str(bow_mean_shift_sil))

In [None]:
# Estimate bandwidth based on data
bandwidth = estimate_bandwidth(tfidf_train_out.toarray(), quantile=0.2, n_samples=500, n_jobs=3)

# Initialize and fit model
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True, min_bin_freq=10, n_jobs=3)
ms.fit(tfidf_train_out.toarray())

# Extract cluster assignments for each data point.
labels = ms.labels_

# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated tf-idf Mean-Shift clusters: {}".format(n_clusters_))

In [None]:
tfidf_full_pred = ms.predict(tfidf_train_out.toarray())

In [None]:
# Check actual vs predicted
pd.crosstab(Y_train, tfidf_full_pred)

tfidf_mean_shift_ari = metrics.adjusted_rand_score(Y_train, tfidf_full_pred)
tfidf_mean_shift_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), tfidf_full_pred, metric='euclidean')

print('ARI for Mean-Shift tf-idf is: ' + str(tfidf_kmeans_ari))
print('Silhouette score for Mean-Shift tf-idf is: ' + str(tfidf_kmeans_sil))

#### Spectral Clustering

In [None]:
from sklearn.cluster import SpectralClustering

# Declare and fit the model
sc = SpectralClustering(n_clusters=15)
sc.fit(bow_train_out)

#Predict clusters
bow_full_pred = sc.fit_predict(bow_train_out)

In [None]:
# Check actual vs predicted
print(pd.crosstab(Y_train,bow_full_pred))

bow_sc_ari = metrics.adjusted_rand_score(Y_train, bow_full_pred)
bow_sc_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), bow_full_pred, metric='euclidean')

print('ARI for Spectral Clustering BoW is: ' + str(bow_sc_ari))
print('Silhouette score for Spectral Clustering BoW is: ' + str(bow_sc_sil))

In [None]:
# Declare and fit the model
sc = SpectralClustering(n_clusters=15)
sc.fit(tfidf_train_out)

#Predict clusters
tfidf_full_pred = sc.fit_predict(tfidf_train_out)

In [None]:
# Check actual vs predicted
print(pd.crosstab(Y_train,tfidf_full_pred))

tfidf_sc_ari = metrics.adjusted_rand_score(Y_train, tfidf_full_pred)
tfidf_sc_sil = metrics.silhouette_score(Y_train.values.reshape(-1,1), tfidf_full_pred, metric='euclidean')

print('ARI for Spectral Clustering tf-idf is: ' + str(tfidf_sc_ari))
print('Silhouette score for Spectral Clustering tf-idf is: ' + str(tfidf_sc_sil))

## Unsupervised Feature Generation

## Model Comparison

## Conclusions