# Cluster Analysis with K-means (Text Clustering)

<img style="float: l;" src="./images/k-means-clustering.png" width="500px">

## Environment Setup

#### First we will import some Python packages that we will use.

In [1]:
import nltk
import re
import pandas as pd
import pickle as pkl
from string import punctuation
from nltk.corpus import stopwords
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import MDS
%matplotlib inline 
import matplotlib.pyplot as plt


## More Environment Setup

#### We may also need to pull in some nltk resources. These resources will help us clean the data.

In [2]:
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Syntatic NLP Processing

#### We will define some Python functions that will perform some syntatic work on our corpus. 

In [3]:
# This function will remove unnecessay puncuation that can add to noise
def stripPunctuation(text):
    return ''.join(c for c in text if c not in punctuation)

# We will want to eventually tokenize a document into sentences and words
def tokenize(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out tokens not containing letters (e.g., numeric tokens, some punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

# We cache the stopword upfront for perfomance
cachedStopWords = stopwords.words("english")

# We will want to remove stopwords which are typiccially high frequency words that aid with  
# text fluency but do not provide much information gain.
def removeStopwords(text):
    no_stop = ' '.join([word for word in text.split() if word not in cachedStopWords])
    return no_stop

## Retrieving our Corpus

#### Let's pull in our corpus that we had serialized out to disk.  

In [4]:
#file = open('differential-corpus.pkl','rb')
file = open('pna-corpus.pkl','rb')
corpus = pkl.load(file)
file.close()

##  Preparing our Corpus for Processing

#### First step is to clean up our corpus. We will strip pucntuation, remove stopwords, and tokenize our text. Actually removing stopwords and tokinization can be done by a downstream function as you will see. We do it here on the first document in our corpus just so you can get a good sense how the corpus gets transformed.

In [6]:
cleanedCorpus = []
ranks = []
labels = []
idx = 1
for key in corpus:        
    cleanedText = stripPunctuation(corpus[key])
    if (idx < 2):
        print("**** ORIGINAL CLINICAL DOCUMENT ****")
        print()
        print(corpus[key])
        print("**** SOME PUNCTUATION REMOVED ****")
        print()        
        print(cleanedText)
        print("**** STOP WORDS REMOVED ****")
        print()
        textWithoutStopwords = removeStopwords(cleanedText)
        print(textWithoutStopwords)
        tokenizedText = tokenize(cleanedText)
        print()
        print("**** LOWER CASE AND TOKENIZE ****")
        print()
        print(tokenizedText)
        print()
    cleanedCorpus.append(cleanedText)
    labels.append(key)
    ranks.append(idx)   
    idx += 1

**** ORIGINAL CLINICAL DOCUMENT ****




     DATE: [**2996-12-2**] 10:25 AM
     CT CHEST W/O CONTRAST                                           Clip # [**Clip Number (Radiology) 11781**]
     Reason: please eval for loculated pleural effusion on the right and 
     Field of view: 30
     ______________________________________________________________________________
     UNDERLYING MEDICAL CONDITION:
      73 year old woman with h/o RUL subtotal resection for lung CA (3/'[**95**]) as well
      as a LLL superior segment resection  for Tb and now presents w/ a RLL PNA and
      parapneumonic effusion concerning for empyema
     REASON FOR THIS EXAMINATION:
      please eval for loculated pleural effusion on the right and place a pigtail
      catheter (send specimen for Gm stain C+S, AFB smear, cytology, Fungal culture)
      please do ASAP as pt may require to go to the OR later this afternoon
     ______________________________________________________________________________
        

##  Quantifying our Clinical Corpus for Cluster Analysis

#### We need to somehow quantify the importance of the words found in a document relative to one another, as well as, relative to the entire corpus. Simply using frequency counts of words are a bias measurement. Longer documents naturally will have greater term frequencies than shorter documents. 

#### We can borrow from the field of informatin retrieval and use a measurement known as tf-idf, short for term frequency-inverse document frequency. This statistic is intended to reflect how important a word is in a document relative to the other words found in the document, as well as, relative to how often a word may be used across a corpus. 

#### TfidfVectorizer will not only produce our tf-idf word statistics but it will also first remove less important words known as stopwords and tokenize our corpus. The end result is a sparse matrix where each row represents a document in our corpus and each column is a distinct word found across all of the documents. 

#### 1) You should be able to understand why we want this information represented as a sparse matrix. Why do you think? 


In [8]:
# Note that max-df is set such that we keep words that can appear across 100% of the corpus and min-df is set
# such that we throw away words that are in less than 10% of our document corpus. We will also generate 
# unigrams and bigrams which is about right form clinical notes. Beyond bigrams we get very sparse....
tfidfVectorizer = TfidfVectorizer(max_df=1.0, max_features=200000,
                                  min_df=0.1, stop_words='english',
                                  use_idf=True, tokenizer=tokenize, ngram_range=(1,2))   

tfidfMatrix = tfidfVectorizer.fit_transform(cleanedCorpus)

print("**** STORED AS A SPARSE MATRIX ****")
print(tfidfMatrix.shape)
print()

**** STORED AS A SPARSE MATRIX ****
(11108, 222)



##  Generated Data to Keep

#### We want to hold onto the terms that were selected by our tf-idf vectorizer. 

#### We will also perform some high-dimensional geometry by calculating the cosine angel between multi-dimensional vectors that represent the text documents. Each element in the document vector represents a distint word that was used in the document. This word element is represented as a tf-idf numeric value. The cosine angle between two document vectors can be measured to determine how alike in topic two documents may be. We will calculate the pairwise similarity between all document pairs. We will need these distance measures for later when we try to visualize our clusters. 


In [9]:
distinctTerms = tfidfVectorizer.get_feature_names()
print("**** SOME DISTINCT TERMS ***")
print(distinctTerms[-10:])
print(distinctTerms[:10])
print()
dist = 1 - cosine_similarity(tfidfMatrix)

**** SOME DISTINCT TERMS ***
['vein', 'venous', 'view', 'visualized', 'wall', 'wo', 'wo contrast', 'woman', 'year', 'year old']
['abdomen', 'abdominal', 'acute', 'admitting', 'admitting diagnosis', 'air', 'amt', 'amt final', 'amt underlying', 'aorta']



##  Time to Build Clusters

#### We will use K-means clustering to search for symantic patterns within our corpus.  K-means is a vector quantization method that attemps to partition $N$ observations (our clinical documents) into $K$ clusters. It assigns an observation to a cluster by trying to allign the observations with the cluster centroid. The cluster centroid represents the center of the cluster. K-means tries to minimize the within-cluster-variance which is a measure of compactness. So the idea is to generate as compact clusters as possible given a set of observations to assign.  

#### K-means is known as a hard-clustering method because each observations must be assigned to a single cluster. Other clustering methods such as expectation-maximization are known as soft-clustering methods because an observation can belong to multiple clusters with a certain probability. 

#### With K-means we need to determine the number of clusters we will produce aprioi. There is no magic here, the best approach is performing some trial and error. 


In [10]:
numClusters = 10
km = KMeans(n_clusters=numClusters)
km.fit(tfidfMatrix)
clusters = km.labels_.tolist()

## Some Basic Cluster Statistics

#### Let's look at the number of documents that were assigned to each cluster. We can also take a look at the group mean of the cluster althogh it does not provide a great deal of information.

In [11]:
clinicalDocuments = { 'labels': labels, 'rank': ranks, 'corpus': cleanedCorpus, 'cluster': clusters }
frame = pd.DataFrame(clinicalDocuments, index = [clusters] , columns = ['rank', 'labels', 'corpus', 'cluster'])
print("**** CLUSTER COUNTS ****")
print(frame['cluster'].value_counts())
print()
grouped = frame['rank'].groupby(frame['cluster'])
print("**** CLUSTER MEANS ****")
print(grouped.mean())
print()


**** CLUSTER COUNTS ****
0    2274
4    1755
5    1535
9    1195
1    1062
3     965
7     880
6     709
2     375
8     358
Name: cluster, dtype: int64

**** CLUSTER MEANS ****
cluster
0    5319.274846
1    5860.382298
2    6577.245333
3    4828.119171
4    5870.433618
5    5545.520521
6    6042.555712
7    5581.265909
8    5667.441341
9    5200.344770
Name: rank, dtype: float64



## Better Insight into our Clusters

#### We will obviously want to understand what documents clustered together and the important terms that were found common among the documents in the cluster. Check out the cluster output.


In [None]:
print("**** TOP 10 TERMS PER CLUSTER ****")
orderedCentroids = km.cluster_centers_.argsort()[:, ::-1] # get indexes - sort smallest to largest then reverse order, largest to smallest
orderedCentroidScores = []
for i in range(numClusters):
    orderedCentroidScores.append(km.cluster_centers_[i, orderedCentroids[i,:]])  
for i in range(numClusters):
    print("Cluster# %d Words:" % i, end='')
    j = 0
    for indice in orderedCentroids[i, :10]:
        print(' %s: %.5f' % (distinctTerms[indice].split(' ')[0], orderedCentroidScores[i][j]), end=',')
        j += 1
    print()
    print()
    print("Cluster# %d Titles:" % i, end='')
    for label in frame.ix[i]['labels'].values.tolist():
        print(' %s,' % label, end='')
    print()
    print()


**** TOP 10 TERMS PER CLUSTER ****
Cluster# 0 Words: chest: 0.16245, ap: 0.11553, portable: 0.10713, pleural: 0.10618, reason: 0.09273, clip: 0.09106, chf: 0.08655, right: 0.08533, effusion: 0.08452, pulmonary: 0.08209,



Cluster# 0 Titles: PNA-POS11, PNA-POS24, PNA-POS29, PNA-POS34, PNA-POS48, PNA-POS57, PNA-POS65, PNA-POS66, PNA-POS72, PNA-POS73, PNA-POS81, PNA-POS82, PNA-POS94, PNA-POS95, PNA-POS101, PNA-POS102, PNA-POS103, PNA-POS104, PNA-POS106, PNA-POS110, PNA-POS111, PNA-POS116, PNA-POS120, PNA-POS121, PNA-POS128, PNA-POS132, PNA-POS133, PNA-POS134, PNA-POS135, PNA-POS138, PNA-POS140, PNA-POS141, PNA-POS142, PNA-POS143, PNA-POS144, PNA-POS154, PNA-POS155, PNA-POS157, PNA-POS159, PNA-POS166, PNA-POS167, PNA-POS169, PNA-POS170, PNA-POS171, PNA-POS173, PNA-POS174, PNA-POS175, PNA-POS179, PNA-POS180, PNA-POS181, PNA-POS182, PNA-POS184, PNA-POS185, PNA-POS186, PNA-POS187, PNA-POS188, PNA-POS202, PNA-POS207, PNA-POS208, PNA-POS219, PNA-POS220, PNA-POS224, PNA-POS230, PNA-POS231, PNA-POS235, PNA-POS236, PNA-POS238, PNA-POS239, PNA-POS241, PNA-POS243, PNA-POS244, PNA-POS254, PNA-POS260, PNA-POS265, PNA-POS267, PNA-POS270, PNA-POS273, PNA-POS274, PNA-POS275, PNA-POS289, PNA-POS292, PNA-POS295, PNA-POS296,

Cluster# 1 Titles: PNA-POS12, PNA-POS27, PNA-POS28, PNA-POS30, PNA-POS32, PNA-POS49, PNA-POS50, PNA-POS53, PNA-POS54, PNA-POS61, PNA-POS63, PNA-POS64, PNA-POS70, PNA-POS71, PNA-POS78, PNA-POS84, PNA-POS91, PNA-POS114, PNA-POS158, PNA-POS195, PNA-POS196, PNA-POS640, PNA-POS641, PNA-POS648, PNA-POS649, PNA-POS737, PNA-POS738, PNA-POS740, PNA-POS742, PNA-POS743, PNA-POS754, PNA-POS755, PNA-POS765, PNA-POS768, PNA-POS769, PNA-POS770, PNA-POS792, PNA-POS821, PNA-POS837, PNA-POS843, PNA-POS876, PNA-POS890, PNA-POS913, PNA-POS932, PNA-POS935, PNA-POS974, PNA-POS975, PNA-POS984, PNA-POS990, PNA-POS993, PNA-POS996, PNA-POS1007, PNA-POS1022, PNA-POS1033, PNA-POS1037, PNA-POS1039, PNA-POS1041, PNA-POS1065, PNA-POS1067, PNA-POS1071, PNA-NEG1, PNA-NEG2, PNA-NEG3, PNA-NEG8, PNA-NEG43, PNA-NEG96, PNA-NEG97, PNA-NEG98, PNA-NEG104, PNA-NEG105, PNA-NEG116, PNA-NEG117, PNA-NEG133, PNA-NEG280, PNA-NEG288, PNA-NEG289, PNA-NEG290, PNA-NEG292, PNA-NEG301, PNA-NEG302, PNA-NEG303, PNA-NEG304, PNA-NEG305, PNA-N

Cluster# 2 Titles: PNA-POS887, PNA-POS888, PNA-POS889, PNA-NEG127, PNA-NEG149, PNA-NEG150, PNA-NEG151, PNA-NEG152, PNA-NEG153, PNA-NEG154, PNA-NEG155, PNA-NEG156, PNA-NEG157, PNA-NEG158, PNA-NEG159, PNA-NEG160, PNA-NEG163, PNA-NEG164, PNA-NEG165, PNA-NEG166, PNA-NEG167, PNA-NEG168, PNA-NEG169, PNA-NEG170, PNA-NEG171, PNA-NEG172, PNA-NEG173, PNA-NEG174, PNA-NEG176, PNA-NEG177, PNA-NEG178, PNA-NEG179, PNA-NEG180, PNA-NEG181, PNA-NEG281, PNA-NEG642, PNA-NEG854, PNA-NEG855, PNA-NEG1058, PNA-NEG1060, PNA-NEG1061, PNA-NEG1062, PNA-NEG1063, PNA-NEG1064, PNA-NEG1065, PNA-NEG1066, PNA-NEG1067, PNA-NEG1068, PNA-NEG1069, PNA-NEG1182, PNA-NEG1183, PNA-NEG1184, PNA-NEG1188, PNA-NEG1189, PNA-NEG1190, PNA-NEG1191, PNA-NEG1201, PNA-NEG1202, PNA-NEG1203, PNA-NEG1204, PNA-NEG1205, PNA-NEG1206, PNA-NEG1207, PNA-NEG1208, PNA-NEG1209, PNA-NEG1210, PNA-NEG1503, PNA-NEG1541, PNA-NEG1542, PNA-NEG1543, PNA-NEG1544, PNA-NEG1545, PNA-NEG1546, PNA-NEG1547, PNA-NEG1548, PNA-NEG1549, PNA-NEG1584, PNA-NEG1585, PNA-N

Cluster# 4 Titles: PNA-POS340, PNA-POS424, PNA-POS480, PNA-POS492, PNA-POS618, PNA-POS619, PNA-POS620, PNA-POS656, PNA-POS660, PNA-POS717, PNA-POS733, PNA-POS739, PNA-POS772, PNA-POS795, PNA-POS814, PNA-POS842, PNA-POS865, PNA-POS870, PNA-POS879, PNA-POS880, PNA-POS882, PNA-POS883, PNA-POS884, PNA-POS898, PNA-POS912, PNA-POS931, PNA-POS970, PNA-POS987, PNA-POS994, PNA-POS998, PNA-POS999, PNA-POS1004, PNA-POS1005, PNA-POS1006, PNA-POS1019, PNA-POS1020, PNA-POS1028, PNA-POS1031, PNA-POS1032, PNA-POS1036, PNA-POS1042, PNA-POS1043, PNA-POS1044, PNA-POS1061, PNA-POS1063, PNA-POS1073, PNA-POS1083, PNA-POS1086, PNA-POS1093, PNA-POS1098, PNA-NEG14, PNA-NEG15, PNA-NEG31, PNA-NEG58, PNA-NEG66, PNA-NEG67, PNA-NEG68, PNA-NEG89, PNA-NEG93, PNA-NEG101, PNA-NEG102, PNA-NEG109, PNA-NEG111, PNA-NEG126, PNA-NEG130, PNA-NEG131, PNA-NEG132, PNA-NEG161, PNA-NEG162, PNA-NEG175, PNA-NEG237, PNA-NEG240, PNA-NEG243, PNA-NEG246, PNA-NEG250, PNA-NEG251, PNA-NEG252, PNA-NEG259, PNA-NEG260, PNA-NEG270, PNA-NEG272,

 PNA-NEG6689, PNA-NEG6690, PNA-NEG6691, PNA-NEG6695, PNA-NEG6696, PNA-NEG6697, PNA-NEG6698, PNA-NEG6699, PNA-NEG6701, PNA-NEG6702, PNA-NEG6703, PNA-NEG6705, PNA-NEG6732, PNA-NEG6744, PNA-NEG6754, PNA-NEG6755, PNA-NEG6756, PNA-NEG6757, PNA-NEG6759, PNA-NEG6761, PNA-NEG6765, PNA-NEG6774, PNA-NEG6777, PNA-NEG6778, PNA-NEG6787, PNA-NEG6797, PNA-NEG6798, PNA-NEG6803, PNA-NEG6813, PNA-NEG6855, PNA-NEG6856, PNA-NEG6857, PNA-NEG6860, PNA-NEG6864, PNA-NEG6865, PNA-NEG6869, PNA-NEG6902, PNA-NEG6906, PNA-NEG6916, PNA-NEG6918, PNA-NEG6919, PNA-NEG6920, PNA-NEG6921, PNA-NEG6973, PNA-NEG6975, PNA-NEG6976, PNA-NEG6991, PNA-NEG7006, PNA-NEG7009, PNA-NEG7010, PNA-NEG7039, PNA-NEG7040, PNA-NEG7053, PNA-NEG7054, PNA-NEG7055, PNA-NEG7056, PNA-NEG7057, PNA-NEG7085, PNA-NEG7086, PNA-NEG7087, PNA-NEG7093, PNA-NEG7094, PNA-NEG7095, PNA-NEG7099, PNA-NEG7112, PNA-NEG7114, PNA-NEG7115, PNA-NEG7128, PNA-NEG7130, PNA-NEG7136, PNA-NEG7140, PNA-NEG7144, PNA-NEG7146, PNA-NEG7147, PNA-NEG7148, PNA-NEG7165, PNA-NEG7167

Cluster# 5 Titles: PNA-POS127, PNA-POS136, PNA-POS160, PNA-POS190, PNA-POS192, PNA-POS198, PNA-POS199, PNA-POS203, PNA-POS204, PNA-POS223, PNA-POS226, PNA-POS237, PNA-POS242, PNA-POS255, PNA-POS266, PNA-POS290, PNA-POS303, PNA-POS311, PNA-POS313, PNA-POS320, PNA-POS325, PNA-POS330, PNA-POS351, PNA-POS360, PNA-POS376, PNA-POS386, PNA-POS387, PNA-POS390, PNA-POS391, PNA-POS392, PNA-POS398, PNA-POS433, PNA-POS434, PNA-POS446, PNA-POS497, PNA-POS503, PNA-POS506, PNA-POS520, PNA-POS546, PNA-POS547, PNA-POS552, PNA-POS572, PNA-POS576, PNA-POS584, PNA-POS586, PNA-POS592, PNA-POS605, PNA-POS606, PNA-POS617, PNA-POS626, PNA-POS627, PNA-POS634, PNA-POS647, PNA-POS654, PNA-POS669, PNA-POS685, PNA-POS686, PNA-POS690, PNA-POS692, PNA-POS693, PNA-POS697, PNA-POS698, PNA-POS699, PNA-POS709, PNA-POS710, PNA-POS711, PNA-POS712, PNA-POS713, PNA-POS714, PNA-POS715, PNA-POS718, PNA-POS720, PNA-POS721, PNA-POS726, PNA-POS727, PNA-POS731, PNA-POS741, PNA-POS745, PNA-POS746, PNA-POS747, PNA-POS749, PNA-POS76

Cluster# 7 Titles: PNA-POS97, PNA-POS98, PNA-POS214, PNA-POS215, PNA-POS225, PNA-POS232, PNA-POS233, PNA-POS310, PNA-POS368, PNA-POS377, PNA-POS426, PNA-POS445, PNA-POS535, PNA-POS542, PNA-POS551, PNA-POS554, PNA-POS555, PNA-POS558, PNA-POS559, PNA-POS577, PNA-POS614, PNA-POS629, PNA-POS631, PNA-POS642, PNA-POS653, PNA-POS672, PNA-POS676, PNA-POS677, PNA-POS687, PNA-POS732, PNA-POS748, PNA-POS758, PNA-POS762, PNA-POS763, PNA-POS764, PNA-POS783, PNA-POS787, PNA-POS788, PNA-POS789, PNA-POS794, PNA-POS796, PNA-POS797, PNA-POS801, PNA-POS802, PNA-POS803, PNA-POS804, PNA-POS824, PNA-POS825, PNA-POS826, PNA-POS827, PNA-POS838, PNA-POS845, PNA-POS869, PNA-POS873, PNA-POS907, PNA-POS924, PNA-POS925, PNA-POS927, PNA-POS928, PNA-POS929, PNA-POS930, PNA-POS944, PNA-POS945, PNA-POS949, PNA-POS953, PNA-POS973, PNA-POS985, PNA-POS1011, PNA-POS1012, PNA-POS1014, PNA-POS1015, PNA-POS1034, PNA-POS1035, PNA-POS1064, PNA-POS1081, PNA-POS1094, PNA-POS1101, PNA-POS1102, PNA-POS1104, PNA-POS1105, PNA-POS110

Cluster# 8 Titles: PNA-POS562, PNA-POS933, PNA-POS934, PNA-POS946, PNA-POS954, PNA-POS955, PNA-POS956, PNA-POS957, PNA-POS958, PNA-POS959, PNA-POS977, PNA-POS988, PNA-POS1000, PNA-POS1001, PNA-POS1002, PNA-POS1003, PNA-POS1013, PNA-POS1074, PNA-POS1075, PNA-POS1076, PNA-POS1077, PNA-POS1078, PNA-POS1079, PNA-POS1080, PNA-POS1082, PNA-POS1096, PNA-POS1099, PNA-NEG262, PNA-NEG266, PNA-NEG284, PNA-NEG387, PNA-NEG473, PNA-NEG474, PNA-NEG475, PNA-NEG476, PNA-NEG477, PNA-NEG707, PNA-NEG752, PNA-NEG756, PNA-NEG757, PNA-NEG809, PNA-NEG820, PNA-NEG821, PNA-NEG842, PNA-NEG846, PNA-NEG848, PNA-NEG911, PNA-NEG1008, PNA-NEG1009, PNA-NEG1147, PNA-NEG1148, PNA-NEG1149, PNA-NEG1150, PNA-NEG1151, PNA-NEG1152, PNA-NEG1179, PNA-NEG1180, PNA-NEG1185, PNA-NEG1195, PNA-NEG1224, PNA-NEG1225, PNA-NEG1226, PNA-NEG1227, PNA-NEG1228, PNA-NEG1248, PNA-NEG1249, PNA-NEG1250, PNA-NEG1251, PNA-NEG1252, PNA-NEG1253, PNA-NEG1254, PNA-NEG1255, PNA-NEG1256, PNA-NEG1273, PNA-NEG1291, PNA-NEG1292, PNA-NEG1293, PNA-NEG1294,

## Cluster Visualization

#### We have a bit of a problem in that we can only conceptualize what high-dimensional space must look like. We are only able to visualize 3-dimensional space and most of us prefer 2-dimensional visualization. With text clustering we are dealing with 100's of dimensions, one for each vector term we used. Well the good news is that we have figured out how to perform reduction mappiing which projects high-dimensional space onto lower 2-dimensional space for visualizetion. We will use a process known as multidemensional scaling that attemps to preserve the distance between objects when reduced to low dimensional space. This may take a little time so be patient, eventually you will see the clusters in 2-dimensional space.    

In [None]:
#plt.switch_backend('agg')
# Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases in a dataset. 
# MDS aims to place each object in n-dimensional space such that the between-object distances are preserved
# in the reduction mapping to two-dimensional space.
MDS()
# "precomputed" because we provide a distance matrix
# "random_state" so the plot is reproducible
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(dist)
xs, ys = pos[:, 0], pos[:, 1]
#set up colors for each clusters
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e', 
                  5: '#ffcc99', 6: '#33ffcc', 7: '#FFFF33', 8: '#ccccff', 9: '#99ccff'}
#set up cluster names for each color
cluster_names = {0: 'Cluster 0', 1: 'Cluster 1', 2: 'Cluster 2', 3: 'Cluster 3', 4: 'Cluster 4', 
                 5: 'Cluster 5', 6: 'Cluster 6', 7: 'Cluster 7', 8: 'Cluster 8', 9: 'Cluster 9'} 
#result of the MDS along with the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=labels)) 
#group by cluster
groups = df.groupby('label')
# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
ax.margins(0.05)
#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, label=cluster_names[name], color=cluster_colors[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(\
        axis= 'x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelbottom='off')
    ax.tick_params(\
        axis= 'y',         # changes apply to the y-axis
        which='both',      # both major and minor ticks are affected
        left='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        labelleft='off')   
ax.legend(numpoints=1)  #show legend with only 1 point
#add label in x,y position with the label as the document name
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=8)  
plt.show() #show the plot