# Simplicial Complexes and Data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx

## Simplicial Homology

This week, we introduced homology of simplicial complexes. The *$k$th homology* is a machine which eats a simplicial complex $X$ and spits out a vector space $H_k(X)$ whose elements are equivalence classes of simplices representing "$k$-dimensional holes" in the complex $X$. 

This should lead you to a natural

**Question:** What does homology have to do with data analysis?

That is, most datasets that you deal with are not naturally structured as simplicial complexes. But this statement is not *quite* true: typical data is a finite set of points in $\mathbb{R}^n$ (or some other metric space), and a finite set of points can be considered as a (somewhat boring) simplicial complex. In particular, if $X$ is a set of $n$ points, then 
$$
H_k(X) \approx \left\{\begin{array}{cc}
                \mathbb{F}_2^n & k = 0 \\
                0 & k \neq 0; \end{array}\right.
$$
i.e., homology just tells us the number of points in $X$ (not very interesting...).

The idea we will persue in coming weeks is to use the vertex set $X$ to create increasingly complicated simplicial complexes which capture the geometry and topology of $X$ at many scales.

To motivate this, let's construct a toy dataset.

In [None]:
X1 = np.random.multivariate_normal([0,0],np.array([[1,0],[0,1]]),size = 10)
X2 = np.random.multivariate_normal([7,0],np.array([[1,0],[0,0.2]]),size = 8)
X3 = np.random.multivariate_normal([0,11],np.array([[1,0],[0,1]]),size = 12)

X = np.concatenate((X1,X2,X3))

plt.plot(X[:,0],X[:,1],'o')
plt.axis('equal')
plt.show()

## Back to Clustering

Recall that one of our motivations for studying topology in the context of data analysis was hierarchical clustering.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage  

plt.figure(figsize=(10,5))
linked = linkage(X)
dendrogram(linked)
plt.show()

This gives a multiscale picture of the clustering structure of the data. But "clusters" are just connected pieces of the data; i.e., elements of $H_0(X)$. The idea is that hierarchical clustering is finding $H_0(X)$ at various levels of 'resolution' in the data.

Now let's try to make this precise.

## Creating a Growing Simplicial Complex

For each $r \geq 0$, define a simplicial complex $X_r = (V_r, \Sigma_r)$, where
$$
V_r = X \qquad \forall r
$$
and $\Sigma_r$ contains only $0$-simplices and $1$-simplices, where the $1$-simplices form the set
$$
\{\{x_i,x_j\} \mid x_i,x_j \in X, d(x_i,x_j) \leq r\},
$$
where $d(x_i,x_j)$ is Euclidean distance.

**Note:** $X_0 = X$.

Let's write a function to generate such a simplicial complex for each $r \geq 0$. 

First we create a distance matrix for $X$.

In [None]:
from sklearn.metrics import pairwise_distances

In [None]:
D = pairwise_distances(X)
plt.imshow(D)
plt.show()

In [None]:
def create_simplicial_complex(D,r):
    """
    Input: distance matrix and nonnegative radius
    Output: networkx graph 
    """
    
    G = nx.Graph()
    G.add_nodes_from(list(range(len(D))))
    edge_list = np.argwhere(D <= r) 
    G.add_edges_from(edge_list)
    
    return G

Try it out:

In [None]:
r = 1
G = create_simplicial_complex(D,r)
nx.draw_kamada_kawai(G)

It looks like the function works, but we would like our drawings to be more related to the original dataset. We can set node positions in `networkx` graphs (there are lots of other options for visualization here).

In [None]:
pos = {n:X[n,:] for n in G.nodes()}
plt.figure(figsize = (5,5))
nx.draw_networkx(G, pos = pos, with_labels = False,node_size = 20)
plt.axis('equal')
plt.show()

Now let's look at how this evolves over a few radii:

In [None]:
rs = [0,0.5,1,1.5,3,4,5,8,10]

plt.figure(figsize = (20,20))

for (j,r) in enumerate(rs):
    G = create_simplicial_complex(D,r)
    plt.subplot(3,3,j+1)
    nx.draw_networkx(G, pos = pos, with_labels = False,node_size = 20)
    plt.axis('equal')
    plt.title('Radius = '+str(r))

plt.show()

Compare again to the dendrogram from above.

In [None]:
dendrogram(linked)
plt.show()

Observe that the heights where clusters merge correspond exactly to the radii $r$ where the corresponding equivalence classes merge in $H_0(X_r)$. 

## Our Next Goals

- Formalize this idea of computing homology of varying simplicial complexes.
- In particular, look at higher-dimensional homology. To get something meaningful here, we'll need to throw in higher dimensional simplices.
- Use this procedure as a way to compare datasets.

## One More Example

To see how higher-dimensional homology would come into play, let's look at one more example.

In [None]:
X = np.random.multivariate_normal([0,0],np.array([[1,0],[0,1]]),size = 100)
X = X.T/np.linalg.norm(X,axis = 1)
X = X.T + 0.5*np.random.rand(100,2)

plt.plot(X[:,0],X[:,1],'o')
plt.axis('equal')
plt.show()

In [None]:
D = pairwise_distances(X)

In [None]:
rs = [0,0.1,0.2,0.3,0.4,0.5,0.6,1,2]
pos = {n:X[n,:] for n in G.nodes()}

plt.figure(figsize = (20,20))

for (j,r) in enumerate(rs):
    G = create_simplicial_complex(D,r)
    plt.subplot(3,3,j+1)
    nx.draw_networkx(G, pos = pos, with_labels = False,node_size = 50)
    plt.axis('equal')
    plt.title('Radius = '+str(r))

plt.show()