# Validation of kmeans clustering
Author: Tristan Miller

I'm loosely following the validation method from Ryan P Adams' lecture notes, provided by Max.  The idea is to create test data with similar statistics to the real data, but with none of the clustering behavior.  Then kmeans is performed on the test data, and a clustering statistic is computed for both the test data and real data.  The best choice of k is that which maximizes the difference in the clustering statistic.

From previous discussion, I thought the best way to generate test data would be to take real data, and shuffle the dimensions of each term vector.

In [11]:
import pandas as pd
import numpy
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import *
import scipy

In [2]:
data = pd.read_pickle("processed_10k_articles.pkl")

In [19]:
#first generate the bag of words.  This has no TF-IDF weighting yet.
#Only include words that occur in at least 5% of documents.
vectorizer = CountVectorizer(analyzer = "word",min_df=0.05)
clean_text = data["process"]
unweighted_words = vectorizer.fit_transform(clean_text)

In [20]:
print(unweighted_words.shape)
print(vectorizer.get_feature_names())

(10000, 209)
['age', 'also', 'america', 'american', 'anoth', 'april', 'area', 'around', 'august', 'award', 'back', 'base', 'becam', 'becom', 'began', 'best', 'book', 'born', 'british', 'call', 'career', 'caus', 'center', 'centuri', 'chang', 'children', 'citi', 'class', 'club', 'com', 'come', 'common', 'commun', 'could', 'counti', 'countri', 'creat', 'current', 'day', 'de', 'death', 'decemb', 'develop', 'die', 'differ', 'earli', 'east', 'end', 'england', 'english', 'even', 'event', 'exampl', 'famili', 'famou', 'februari', 'first', 'follow', 'footbal', 'form', 'former', 'found', 'four', 'franc', 'french', 'game', 'gener', 'german', 'get', 'given', 'go', 'good', 'govern', 'great', 'group', 'help', 'high', 'histori', 'home', 'hous', 'howev', 'http', 'ii', 'import', 'includ', 'intern', 'januari', 'japanes', 'john', 'juli', 'june', 'kill', 'king', 'known', 'la', 'larg', 'last', 'later', 'leagu', 'left', 'life', 'like', 'list', 'live', 'london', 'long', 'made', 'main', 'major', 'make', 'man',

For some reason there are only 209 features now?  Digits are no longer included, but there are also other words missing compared to the previous list.

I'm also worried by the appearance of www and wikit.  It seems that the regex isn't successfully filtering out everything yet.

In [21]:
#TF-IDF weighting can be applied after the fact with TfidfTransformer
Tfidf = TfidfTransformer()
#IDF weights only need to be calculated once, and can be reused for test data.
Tfidf.fit(unweighted_words)
#Now we apply the weights
real_data = Tfidf.transform(unweighted_words)

In [24]:
from sklearn.cluster import KMeans

In [26]:
#run kmeans on real data
kmeans=KMeans(n_clusters=15)
kmeans_pred=kmeans.fit_predict(real_data)

In [61]:
#print out some information about a cluster
def cluster_info(cluster_index,kmeans,kmeans_pred):
    #print out the top terms at the cluster center
    cluster_mean = pd.DataFrame(index=range(real_data.shape[1]),columns=['term','frequency'])
    cluster_mean['term']=vectorizer.get_feature_names()
    cluster_mean['frequency']=kmeans.cluster_centers_[cluster_index]
    cluster_mean.sort_values('frequency',ascending=False,inplace=True)
    print(cluster_mean[0:10])

    #print out some titles
    print([data['title'][i] for i in range(len(data)) if kmeans_pred[i] == cluster_index][0:21])

cluster_info(9,kmeans,kmeans_pred)

       term  frequency
32   commun   0.486034
63    franc   0.422043
148  region   0.275081
61    found   0.220306
22   center   0.095247
39       de   0.086594
147   refer   0.058293
124   north   0.043674
168   south   0.041866
64   french   0.037705
['Henry II of France', 'Francis II of France', 'Region of Murcia', 'Medieval commune', 'Limoges', 'Champagne-Ardenne', 'Libourne', 'Luxembourg franc', 'French Community', 'Prime Minister of France', "Albon-d'Ardèche", 'Cellier-du-Luc', 'Charnas', 'Viviers, Ardèche', 'Rosières, Ardèche', 'Lake Annecy', 'Haute-Savoie', 'Province of Palermo', 'Province of Biella', 'Province of Genoa', 'Province of Bologna']


In [120]:
from scipy.spatial.distance import sqeuclidean

#Now let's get the clustering statistic
def cluster_statistic(bag_of_words,kmeans,kmeans_pred):
    running_sum = [0]*len(kmeans.cluster_centers_)
    
    for i in range(len(kmeans_pred)):
        #for each data point, add the square distance to the cluster center to the running sum
        running_sum[kmeans_pred[i]] += sqeuclidean(kmeans.cluster_centers_[kmeans_pred[i]],bag_of_words[i,:].toarray())
    #normalize each cluster to the size of the cluster
    for k in range(len(running_sum)):
        cluster_size = len([i for i in range(len(kmeans_pred)) if kmeans_pred[i] == k])
        running_sum[k] /= cluster_size
    return sum(running_sum)

In [121]:
stat = cluster_statistic(real_data,kmeans,kmeans_pred)

In [122]:
print(stat)

10.0165384481
