# Visualizing High-Dimensional Datasets with Tensorboard's Embedding Projector

![](projector_screenshot.png)

### What's an embedding?
- "a mapping from discrete objects to vectors of real numbers."
- tries to capture the information of a system in a (usually) high-dimensional vector space
- often the input/output for machine learning models

**Example:** a phase-space embedding of particles in a simulation 
![](https://upload.wikimedia.org/wikipedia/commons/f/f7/Hamiltonian_flow_classical.gif)

**or:** a 300-dimensional embedding of English words
```
blue:  (0.01359, 0.00075997, 0.24608, ..., -0.2524, 1.0048, 0.06259)
blues:  (0.01396, 0.11887, -0.48963, ..., 0.033483, -0.10007, 0.1158)
orange:  (-0.24776, -0.12359, 0.20986, ..., 0.079717, 0.23865, -0.014213)
oranges:  (-0.35609, 0.21854, 0.080944, ..., -0.35413, 0.38511, -0.070976)
```


### Google's Embedding Projector
[Embedding projector tutorial](https://www.tensorflow.org/guide/embedding)

**Some terminology:**
- Tensorflow is Google's machine learning framework
- Tensorboard is Tensorflow's visualization suite
- The embedding projector is a tool inside of Tensorboard

[Original embedding projector paper](https://arxiv.org/pdf/1611.05469v1.pdf) 
- Authors find three common tasks:
![](embedding_projector_tasks.png)


Standalone projector: https://projector.tensorflow.org  
- [Wikipedia: Iris data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)











### How can we load in our own data?

In [1]:
# import tensorflow and embedding projector
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

# other stuff
import numpy as np
import pandas as pd
import pathlib # pathlib2 if in Python 2

# function to load data into tensorboard format
def to_tensorboard(name, vectors, metadata, output_dir='tensorboard'):
    # make sure output directory exists
    output_dir = pathlib.Path(output_dir); output_dir.mkdir(exist_ok=True)
        
    # write metadata as tsv
    metadata_path = output_dir / (name + '_metadata.tsv')
    metadata.to_csv(metadata_path, index=False, sep="\t")

    # pass vectors & metadata path to a bunch of Tensorflow commands that I don't understand
    session = tf.Session()
    embedding = tf.Variable(np.array(vectors), trainable=False, name=name)
    tf.global_variables_initializer().run(session=session)

    saver = tf.train.Saver();
    writer = tf.summary.FileWriter(str(output_dir), session.graph)

    config = projector.ProjectorConfig()
    embed = config.embeddings.add()
    embed.tensor_name = name; 
    embed.metadata_path = metadata_path.name
    
    projector.visualize_embeddings(writer, config)
    saver.save(session, str(output_dir / (name + '.ckpt')))
    
    print('Run `tensorboard --logdir={}` to run visualize result on tensorboard'.format(output_dir))


In [2]:
# load into tensorboard
import seaborn as sns
iris = sns.load_dataset('iris')
to_tensorboard(name='iris2', vectors=iris.select_dtypes('number'), metadata=iris)

Run `tensorboard --logdir=tensorboard` to run visualize result on tensorboard


### A more complex/realistic example: word embeddings
- many data points
- high dimensional
- dimesions don't have easily interpretable meaning

In [3]:
import gensim.models.word2vec as word2vec
wv = word2vec.Word2Vec.load('word2vec_model').wv
print(wv.vectors.shape)
wv['project']

(12526, 300)


array([ 9.86255705e-01,  2.32414693e-01, -2.15211794e-01,  1.25850305e-01,
        7.92266369e-01,  1.07824523e-02, -4.35131043e-01,  2.22540751e-01,
        7.14744210e-01,  6.94814205e-01, -5.41788757e-01,  3.00752729e-01,
        1.96971729e-01, -4.86916482e-01,  5.82709074e-01,  1.49579003e-01,
       -2.48201296e-01,  4.76160944e-01, -1.73322096e-01, -1.43849447e-01,
        3.65187190e-02,  8.02628040e-01, -8.47673178e-01, -5.06042063e-01,
       -8.67615402e-01, -3.10089409e-01, -4.11390007e-01, -1.50633126e-01,
       -1.21625453e-01,  3.36989522e-01, -6.91168725e-01,  1.05551690e-01,
        4.30780739e-01, -3.02812189e-01,  1.25664473e+00, -2.01403484e-01,
        7.19923675e-01, -1.47228643e-01, -8.60940456e-01, -5.63046575e-01,
        1.59435615e-01, -2.49337614e-01,  4.52003777e-01, -5.11701584e-01,
       -1.44405380e-01, -6.91081583e-02, -3.43892783e-01, -4.78274345e-01,
       -7.82294199e-02, -1.01591146e+00, -4.02229279e-01,  1.50259173e+00,
        2.88873583e-01, -

In [4]:
# get how often each word appears
counts = [wv.vocab[word].count for word in wv.index2word]

# perform clustering on the word embeddings
from sklearn import mixture
bgmm = mixture.BayesianGaussianMixture(
    n_components=20,
    weight_concentration_prior_type='dirichlet_process',
    weight_concentration_prior=1e7)
bgmm.fit(wv.vectors); cluster_labels = bgmm.predict(wv.vectors)

# package everything into a metadata DataFrame
metadata = pd.DataFrame(
    {'Word' : wv.index2word, 
    'Counts' : counts, 
    'Cluster' : cluster_labels})
metadata


Unnamed: 0,Word,Counts,Cluster
0,the,291436,8
1,of,197853,8
2,NUM,136753,19
3,and,126016,8
4,to,114417,8
5,in,86112,8
6,a,67384,8
7,PROPN,45314,4
8,that,43573,8
9,is,43396,7


In [5]:
to_tensorboard(name='word2vec', vectors=wv.vectors, metadata=metadata)


Run `tensorboard --logdir=tensorboard` to run visualize result on tensorboard
