# Document Clustering: An Evaluation of Feature Matricies

- Wilfrid Laurier University (Winter 2018)
- CS640 - Introduction to Machine Learning
- Ryan Kazmerik (175826410)

## Overview
{ Still need to write this section, providing a brief overview of the experiment }

## Datasets
### 1. 20newsgroups
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics and has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Let's import the 20newsgroups dataset:

In [65]:
from sklearn.datasets import fetch_20newsgroups

# ds1 = 20-newsgroups dataset
ds1 = fetch_20newsgroups(subset='all', categories=None, shuffle=True,  
                         random_state=1, remove=('footers','quotes'))

# Extract the target classes
ds1_labels = ds1.target

print '20-newsgroups dataset:'
print '   Documents=', len(ds1.data)
print '   Categories=', len(ds1.target_names)

20-newsgroups dataset:
   Documents= 18846
   Categories= 20


### 2. Reuters-21578
Reuters-21578 is commonly used collection for text clustering and  classification as it contains structured information about newswire articles that can be assigned to several classes, making it a multi-label problem.

Let's import the reuters-21578 dataset:

In [74]:
import nltk
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.corpus import reuters

docs = reuters.fileids()

# ds2 = Reuters-21578 dataset
ds2_ids = list(docs)

ds2 = [reuters.raw(doc_id) for doc_id in ds2_ids]

# Extract the target classes
mlb = MultiLabelBinarizer()
ds2_labels = mlb.fit_transform([reuters.categories(doc_id)
                                  for doc_id in ds2_ids])
 
print 'Reuters-21578 dataset:'
print '   Documents=', len(ds2)
print '   Categories=', len(reuters.categories())

Reuters-21578 dataset:
   Documents= 10788
   Categories= 90


## Vectorizing Features
The text in the documents must be parsed to remove stop words (tokenization) and the words need to be encoded as floating point values to be used as input for our clustering algorithm (vectorization).

Let's create some feature vectors for our datasets:

In [69]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Use TF-IDF vectorizer for Kmeans & PLSA
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, 
                     max_features=10, stop_words='english')

# Use TF (term count) vectorizer for LDA
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                     max_features=10, stop_words='english')

# Generate feature sets for fs1 (20newsgroups) and fs2 (reuters)
fs1_idf = tfidf_vectorizer.fit_transform(ds1.data)
fs1_tf = tf_vectorizer.fit_transform(ds1.data)

fs2_idf = tfidf_vectorizer.fit_transform(ds2)
fs2_tf = tf_vectorizer.fit_transform(ds2)

print '20-newsgroups features:'
print '   Num features:', fs1_idf.shape[1]
print '   Non-zero components:', fs1_tf.nnz / float(fs1_tf.shape[0])
print ''
print 'Reuters-21578 features:'
print '   Num features:', fs2_idf.shape[1]
print '   TF Non-zero components:', fs2_tf.nnz / float(fs2_tf.shape[0])

20-newsgroups features:
   Num features: 10
   Non-zero components: 3.28207577205

Reuters-21578 features:
   Num features: 10
   TF Non-zero components: 3.77975528365


## Clustering (Kmeans, LDA, PLSA)
Three algorithms are used to cluster the results including a standard implementation of Kmeans, Non-negative Matrix Factorization is applied with the generalized Kullback-Leibler divergence which is equivalent to Probabilistic Latent Semantic Analysis (PLSA).

Let's cluster our feature sets:

In [75]:
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation, NMF
from time import time

# Fit the kMeans model KM1=20-newsgroups KM2=Reuters-21578
t0 = time()
KM = KMeans(n_clusters=4, init='k-means++', max_iter=100, 
            n_init=1, verbose=0)
print'Fitting the kMeans model...'

KM1 = KM.fit(fs1_idf)
KM2 = KM.fit(fs2_idf)

print'   done in:', (time()-t0)
print ''

# Fit the LDA model LDA1=20-newsgroups LDA2=Reuters-21578
t0 = time()
LDA = LatentDirichletAllocation(n_components=10, max_iter=5,
            learning_method='online', learning_offset=50., random_state=0)
print 'Fitting the LDA model...'
LDA1 = LDA.fit(fs1_tf)
LDA2 = LDA.fit(fs2_tf)

print'   done in:', (time()-t0)
print ''

# Fit the PLSA model PLSA1=20-newsgroups PLSA2=Reuters-21578
t0 = time()
PLSA = NMF(n_components=10, random_state=1, beta_loss='kullback-leibler', 
           solver='mu', max_iter=1000, alpha=.1, l1_ratio=.5)
print 'Fitting the PLSA model...'

PLSA2 = PLSA.fit(fs1_idf)
PLSA2 = PLSA.fit(fs2_idf)

print'   done in:', (time()-t0)

Fitting the kMeans model...
   done in: 0.822370052338

Fitting the LDA model...
   done in: 45.2499690056

Fitting the PLSA model...
   done in: 7.73369312286


## Evaluate the Results

In [82]:
from sklearn import metrics

#print(KM1.labels_[:100])
#print(KM2.labels_[:100])

print(LDA1.components_[:100])

#print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
#print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
#print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
#print("Adjusted Rand-Index: %.3f"
#      % metrics.adjusted_rand_score(labels, km.labels_))
#print("Silhouette Coefficient: %0.3f"
#      % metrics.silhouette_score(F2, km.labels_, sample_size=1000))

[[1.31412012e+00 1.35938391e+00 9.91809913e+03 8.88336655e+00
  2.74883958e+02 6.34923003e+01 2.91365698e+00 2.11908802e+01
  1.78439919e-01 1.70264819e+02]
 [1.33244633e-01 1.00031178e-01 4.02829341e-01 7.40479739e-01
  2.69868202e+01 1.15670987e-01 9.19756762e+03 1.61191738e+03
  1.02393087e-01 9.32339221e+02]
 [3.79345252e+03 1.00061444e-01 1.00344143e-01 1.00105207e-01
  1.39244693e-01 1.00041264e-01 1.00183233e-01 4.25975096e+02
  1.00034525e-01 1.87216583e+02]
 [1.12516250e-01 1.13113934e-01 1.07517205e+01 2.57183637e+00
  1.29481787e-01 1.00034953e-01 1.00889359e-01 1.52040167e+04
  1.00173577e-01 2.72747767e+02]
 [1.00000398e+02 4.68221938e+01 1.18427829e+03 5.68348780e+02
  6.54274445e+03 1.90331586e+03 1.06529451e-01 1.00022943e-01
  4.60929826e+03 3.15307834e+02]
 [1.01682287e-01 1.18290191e-01 3.37440868e+02 6.59715773e+01
  8.81324660e+03 7.54391430e-01 3.51796822e+00 1.89256978e+03
  1.00024051e-01 5.00263400e+02]
 [6.32005639e+03 3.47933151e+03 8.10780801e+01 1.05617202e

## View Cluster Top Terms

In [32]:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(4):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: god com keith atheism atheists people religion don sgi say
Cluster 1: game ca team hockey games nhl play espn university season
Cluster 2: com people cramer government optilink don state clinton gay just
Cluster 3: space nasa access henry com moon gov shuttle orbit digex


[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
