The extra data file science2k-doc-word.npy contains a 1373x5476 matrix, where each row
is an article in Science described by 5476 word features. The articles and words are in the
same order as in the vocabulary and titles files above. 

In [1]:
import warnings; warnings.simplefilter('ignore')

import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances

doc_word = np.load("science2k-doc-word.npy")
word_doc = np.load("science2k-word-doc.npy")

titles = pd.read_table('science2k-titles.txt').values
vocab = pd.read_table('science2k-vocab.txt').values

print(doc_word.shape)
print(word_doc.shape)

(1373, 5476)
(5476, 1373)


Clustering the documents using k-means and various values of k.

In [None]:
inertia_doc_word = []
inertia_word_doc = []

for n in range(1, 30):
    kmean = KMeans(n_clusters = n).fit(doc_word)
    inertia_doc_word.append(kmean.inertia_)
    kmean = KMeans(n_clusters = n).fit(word_doc)
    inertia_word_doc.append(kmean.inertia_)
    
fig, ax = plt.subplots()
ax.plot(range(1,30),inertia_doc_word)
ax.plot(range(1,30),inertia_word_doc)
plt.legend(['doc-word','word-doc'])
plt.xlabel("K-Value")
plt.ylabel("Sum Distance to Mean")
plt.show()

### Seven seems to be a good choice for K. This seems to be about the number at which the sum of the distance to the mean begins to become less steep. 

The top 10 words of each cluster in order of the largest positive distance:

In [None]:
num_clusters = 7
kmeans = KMeans(n_clusters = num_clusters).fit(doc_word)
clusters = kmeans.cluster_centers_
center = np.mean(doc_word,axis=0).reshape(1, -1)
labels = kmeans.predict(doc_word)

for cluster in range(0, num_clusters):
    distance = euclidean_distances(doc_word[labels == cluster], center)
    largest_10_pos_dist = np.argsort(-distance, axis = 0)[:10] 
    words = []
    for dist in largest_10_pos_dist:
        words.append(vocab[dist][0][0])
    print ('Cluster: %s, Furthest 10 Words: %s' % (cluster, words))

The top ten documents that fall closest to each cluster center:

In [None]:
for cluster in range(0, num_clusters):
    distance = euclidean_distances(doc_word[labels == cluster], clusters[cluster].reshape(1, -1))
    closest_10_titles = np.argsort(-distance, axis = 0)[:10]     
    acticle_titles = []
    for dist in closest_10_titles:
        acticle_titles.append(titles[dist][0][0])
    print ('Cluster: %s, Top 10 Titles: %s' %(cluster, acticle_titles))

### The first algorithm has captured the ten words in each cluster that a farthest from the mean word. In other words, these are the words in each cluster that are most "different" from the mean word. This could useful in identify atypical or unusual words in each cluster.
### The second algorithm has captured the ten documents that are closest to each cluster center. These are the documents that are most strongly represented by each cluster. If the clusters are able to distill document topic, this algorithm may be helpful for grouping documents by topic.

The file science2k-word-doc.txt is similar, but captures term-wise rather than document-wise
features. That is, for each term, we count the frequency as the number of documents that
term appears in rather than the other way around. This allows us to characterize individual
terms.
This matrix is 5476x1373, where each row is a term in Science described by 1373 “document”
features. These are transformed document frequencies (as above). Below we will repeat the analysis above,
but cluster terms instead of documents.

In [None]:
num_clusters = 7
kmeans = KMeans(n_clusters = num_clusters).fit(word_doc)
clusters = kmeans.cluster_centers_
center = np.mean(word_doc,axis=0).reshape(1, -1)
labels = kmeans.predict(word_doc)

for cluster in range(0, num_clusters):
    distance = euclidean_distances(word_doc[labels == cluster], center)
    largest_10_pos_dist = np.argsort(-distance, axis = 0)[:10] 
    words = []
    for dist in largest_10_pos_dist:
        words.append(vocab[dist][0][0])
    print ('Cluster: %s, Furthest 10 Words: %s' % (cluster, words))

In [None]:
for cluster in range(0, num_clusters):
    distance = euclidean_distances(word_doc[labels == cluster], clusters[cluster].reshape(1, -1))
    closest_10_titles = np.argsort(-distance, axis = 0)[:10]     
    acticle_titles = []
    for dist in closest_10_titles:
        acticle_titles.append(titles[dist][0][0])
    print ('Cluster: %s, Top 10 Titles: %s' %(cluster, acticle_titles))

### The first algorithm has captured the ten words in each cluster (clustered by frequency of documents they appear in) that a furthest from the mean word. In other words, these are the words in each cluster that are most "different" from the mean word. This could useful in identify atypical words in each cluster. 
### The second algorithm has captured the ten documents that closest to each cluster center. These are the documents that are most strongly represented by each cluster. 
