# Cluster Analysis

This notebook focuses on analysing the word clusters for a model. This includes visualizing fit of the clusters, formating them for manual inspection, and visualizing them using Multi Dimensional Scaling (MDS).

We first import the libaries we will need throughout the project

In [None]:
#Import graphing utilities
%matplotlib inline
import matplotlib.pyplot as plt

# Import useful mathematical libraries
import numpy as np
import pandas as pd

# Import useful Machine learning libraries
import gensim

# Import utility files
from utils import save_object,load_object, make_clustering_objects

#### Setup directories

If this is the first time doing this analysis, 
we first will set up all the directories we need
to save and load the models we will be using

In [None]:
import os
directories = ['cluster-analysis']
for dirname in directories:
    if not os.path.exists(dirname):
        os.makedirs(dirname)

### Set model name

Before begining the rest of this project, we select a name for our model. This name will be used to save and load the files for this model

In [None]:
# Set the model we are going to be analyzing
model_name = "example_model"

### Measure fit

Now that we have initialized all we need for our analysis, we can procceed to examine the fit of each clustering.

In [None]:
# Load the fit and test point values
fit = load_object('objects/', model_name + "-words" + "-fit")
test_points = load_object('objects/', model_name + "-words" + "-test_points")

In [None]:
# Plot the fit for each size
plt.plot(test_points, fit, 'ro')
plt.axis([0, 400, 0, np.ceil(fit[0] + (1/10)*fit[0])])
plt.show()

### Format for inspection

After measuring the fit of each clustering, we can decide the number of clusters to use, and further focus on this clustering. To better examine this clustering, we convert the clustering into an readable csv here.

In [None]:
# Set the number of clusters to analyze
num_clusters = 100 

In [None]:
# load the models
model = gensim.models.Word2Vec.load('models/' + model_name + '.model')
kmeans = load_object('clusters/', model_name + "-words-cluster_model-" + str(num_clusters))
WordsByFeatures = load_object('matricies/', model_name + '-' + 'WordsByFeatures')

In [None]:
vocab_list = sorted(list(model.wv.vocab))

In [None]:
clusters = make_clustering_objects(model, kmeans, vocab_list, WordsByFeatures)

In [None]:
# Sort all the words in the words list
for cluster in clusters:
    cluster["word_list"].sort(key = lambda x:x[1], reverse = True)

In [None]:
# Set the number of words to display. The table with contain the top size_words_list words
size_words_list = 100
table = []
for i in range(len(clusters)):
    row = []
    row.append("cluster " + str(i+1))
    row.append(clusters[i]["total_freq"])
    row.append(clusters[i]["unique_words"])
    for j in range(size_words_list):
        try:
            row.append(clusters[i]["word_list"][j])
        except:
            break
    table.append(row)

In [None]:
import csv
with open('cluster-analysis/' + model_name + "-" + str(num_clusters) + '.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    [writer.writerow(r) for r in table]

#### Display Clusters Using MDS

Produce a visualization of our clusters in a low dimensional space

In [None]:
# Fit the model to the clusters
from sklearn.manifold import MDS
mds = MDS().fit(kmeans.cluster_centers_)

In [None]:
# Get the embeddings
embedding = mds.embedding_.tolist()
x = list(map(lambda x:x[0], embedding))
y = list(map(lambda x:x[1], embedding))

In [None]:
top_words= list(map(lambda x: x[0][0], map(lambda x: x["word_list"], clusters)))

In [None]:
# Plot the Graph with top words
plt.figure(figsize = (20, 10))
plt.plot(x, y, 'bo')
for i in range(len(top_words)):
    plt.annotate(top_words[i], (x[i], y[i]))
plt.show()