# Intrinsic persistent homology

This notebook contains a comparison of different distances to be considered when computing persistent homology of a point cloud assumed to be sample points _close_ to a **manifold** in a high dimensional Euclidean space.

#### Libraries

In [None]:
import numpy as np

#Plotting functions
from gtda.plotting import plot_point_cloud
from gtda.plotting import plot_heatmap

#Persistent homology
from gtda.homology import VietorisRipsPersistence
from gtda.plotting import plot_diagram

#Graph distances
from gtda.graphs import KNeighborsGraph, GraphGeodesicDistance
from sklearn.metrics.pairwise import euclidean_distances

#Manifolds
from sklearn.manifold import MDS

## Point cloud: the trefoil knot

In [None]:
def generate_noise(type_noise, var, n):
    if type_noise == 'normal':
        return np.random.normal(0, var, n)
    if type_noise == 'uniform':
        return np.random.uniform(-var, var, n)

def trefoil(n, type_noise, var):
    '''
    Sample on trefoil curve with noise and outliers.
    
    Input:
    n: an integer, number of points in the sample
    type_noise: string, 'normal' or uniform
    var: a float, variance of the noise    
    Output:
    data: a nx3 array, representing points in R^3
    '''
    
    phi = np.linspace(0,2*np.pi,n)
    
    X = np.sin(phi)+2*np.sin(2*phi)  + generate_noise(type_noise, var, n)
    Y = np.cos(phi)-2*np.cos(2*phi)  + generate_noise(type_noise, var, n)
    Z = - np.sin(3*phi)           + generate_noise(type_noise, var, n)
    
    data = np.column_stack((X,Y,Z))
    
    return data

We generate a noisy sample of the **trefoil knot**.

In [None]:
np.random.seed(1212)
point_cloud = trefoil(1500, 'normal', 0.18)
plot_point_cloud(point_cloud)

Notice that although the trefoil knot is a _non-trivial_ embedding of $S^1$ in $\mathbb{R}^3$, it is homeomorphic to a circle. In particular, its shares the same homology groups and so its Betti numbers are $\beta_0 = 1$, $\beta_1 = 1$ and $\beta_i = 0$ for $i>1$.

## Persistent Homology

Although different Riemannian metrics does not change the topology of a compact Riemmanian manifold viewed as a metric space, the choice of the input distance when computing persistent homology from a sample plays a central role. In what follows, we will how the choice of different metrics in the sample affects the information produced by the associated persistent diagram, in the example of the trefoil knot.

In [None]:
homology_dimensions = (0, 1)

* **Euclidean distance**

Given a sample $\mathbb{X_n}$ of points in the Euclidean space, the Euclidean distance is the most used and easily computable input distance when computing persistence diagrams. However, it does not capture the intrinsic information of the underlying topological space, being the output diagram highly dependent on the particular embedding of the data.

In [None]:
E_matrix = euclidean_distances(point_cloud, point_cloud)
plot_heatmap(E_matrix, colorscale='blues')

In [None]:
VR_E = VietorisRipsPersistence(homology_dimensions=homology_dimensions, metric = 'euclidean')
diagram_E = VR_E.fit_transform(point_cloud[None, :, :])
VR_E.plot(diagram_E)

The persistence diagram shows many salient generators for the first homology group, as consequence of areas of small reach in the embedding of $S^1$ in $\mathbb{R}^3$. In order to capture intrinsic information of the underlying manifold (less dependent on the particular embedding), it is more desirable to endow the sample with an estimator of an intrinsic distance.

* **kNN distance**

Given the point cloud, compute kNN-distances for a given integer parameter $k>0$ as follows:

$$ d_{kNN}(x,y) = \displaystyle \inf_{\gamma}\sum_{i}|x_i-x_{i-1}|$$
    
where the infimum is over all finite paths $\gamma = (x_0 = x, x_1, \dots, x_{r-1}, x_r = y)$ between $x$ and $y$ over the kNN graph over the point cloud.
    
In the foundational [article](https://www.science.org/doi/10.1126/science.290.5500.2319) of the popular method ISOMAP, the authors show that if the sample belongs to a Riemannian manifold $\mathcal M$ embedded in the Euclidean space, then $d_{kNN}(x,y)$ converges to the geodesic distance $d_{\mathcal M}(x,y)$. However, this metric is highly sensitive to noise.

In [None]:
k = 5
kNN = KNeighborsGraph(n_neighbors=k)
X_kNN = kNN.fit_transform(point_cloud[None,:,:])
GGD = GraphGeodesicDistance()
kNN_matrix = GGD.fit_transform_plot(X_kNN)

In [None]:
VR_kNN = VietorisRipsPersistence(homology_dimensions=homology_dimensions, metric = 'precomputed')
diagram_kNN = VR_kNN.fit_transform(kNN_matrix)
VR_kNN.plot(diagram_kNN)

* **Fermat distance**

In presence of noise, it may be useful to take into account the underlying density that produces the sample. 

Given the point cloud $\mathbb{X}_n$, compute **Fermat distances** for a given  parameter $p>1$ as follows:

$$ d_{\mathbb{X}_n,p}(x,y) = \displaystyle \inf_{\gamma}\sum_{i}|x_i-x_{i-1}|^p$$
    
where the infimum is over all finite paths $\gamma = (x_0 = x, x_1, \dots, x_{r-1}, x_r = y)$ between $x$ and $y$ over the complete graph over the point cloud.
    
In this [article](https://arxiv.org/abs/2012.07621), the authors proved that if the point cloud belongs to a $d$-dimensional Riemannian manifold $\mathcal M$ embedded in a higher dimensional Euclidean space and the sample is produced according to a positive density $f:\mathcal M\to \mathbb{R}$, then $d_{p}(x,y)$ converges (modulo a reescaling factor) to a deformed geodesic distance $$d_{\mathcal M, f, p}(x,y) = \inf_{\gamma} \int_{\gamma}\frac{1}{f^{(p-1)/d}}$$
where the infimum is over all smooth paths $\gamma : [0,1]\to \mathcal M$ between $x$ and $y$ over the manifold.
Notice that $d_{p}(x,y)$ penalizes areas of low density.

It can be shown that if the geodesics in $\mathbb{X}_n$ are computed over the kNN-graph for $k = O(\log(n))$, then there is also convergence with high probability of the Fermat distance $d_{\mathbb{X}_n, p}$ towards the deformed geodesic  distance $d_{\mathcal M, f, p}$

Notice that Fermat distance is a generalization of both the Euclidean and the kNN distance. Indeed, for $p=1$ and $k$ the size of the sample, we recover the ambient Euclidean distance. On the other hand, for $p=1$ and a smaller value of $k$ we recover the kNN-distance.

In [None]:
# we should modify this part according to the version of fermat in giotto-tda

from fermat import Fermat
from scipy.spatial import  distance_matrix

def compute_fermat_distance(data, p, k):
    
    #Compute euclidean distances
    distances = distance_matrix(data, data)
    
    # Initialize the model
    fermat = Fermat(alpha = p, path_method='D', k = k) #method Dijkstra

    # Fit
    fermat.fit(distances)
    
    ##Compute Fermat distances
    fermat_dist = fermat.get_distances()
    
    return  fermat_dist

In [None]:
p = 7
Fermat_matrix = compute_fermat_distance(point_cloud, p, int(np.log(len(point_cloud))))
plot_heatmap(Fermat_matrix, colorscale='blues', title = 'Fermat distance for p = %s'%p)   

In [None]:
VR_Fermat = VietorisRipsPersistence(homology_dimensions=homology_dimensions, metric = 'precomputed')
diagram_Fermat = VR_Fermat.fit_transform(Fermat_matrix[None,:,:])
VR_Fermat.plot(diagram_Fermat)

#### The effect of deformation of Fermat distance with respect to $p$

The value of the $p$ plays a relevant role in the computation of Fermat distance, since it quantifies the effect of deformation derived from the density.

In [None]:
def Riemmanian_deformation(point_cloud, p, k, n_components):
    embedding = MDS(n_components=n_components, dissimilarity='precomputed')
    Fermat_matrix = compute_fermat_distance(point_cloud, p, k)
    embedding_pc = embedding.fit_transform(Fermat_matrix)
    return embedding_pc

* **Fermat deformation for $p=3$**

In [None]:
p = 3
k = int(np.log(len(point_cloud)))
deformed_pc = Riemmanian_deformation(point_cloud, p, k, 3)
plot_point_cloud(deformed_pc)

* **Fermat deformation for $p=5$**

In [None]:
p = 5
k = int(np.log(len(point_cloud)))
deformed_pc = Riemmanian_deformation(point_cloud, p, k, 3)
plot_point_cloud(deformed_pc)

* **Fermat deformation for $p=7$**

In [None]:
p = 7
k = int(np.log(len(point_cloud)))
deformed_pc = Riemmanian_deformation(point_cloud, p, k, 3)
plot_point_cloud(deformed_pc)

For $p=1$ and $k>0$, we recover the intrinsic deformation carried by the kNN-distance, which is strongly affected by the presence of noise in areas of small reach.

In [None]:
p = 1
k = 5
deformed_pc = Riemmanian_deformation(point_cloud, p, k, 3)
plot_point_cloud(deformed_pc)

## Robustness to outliers

The presence of outliers is another factor that strongly impacts the performance of computations of persistent homology. Whereas the accuracy of the approximation of the geodesic distance by the kNN-distance may be dramatically affected by the existence of outliers, the intrinsic information captured by the persistence diagrams using Fermat distance is reliable even for samples with outliers. Indeed, it remains unnafected for positive homology degrees.

Let's add some outliers to our original sample of the trefoil knot.

In [None]:
n_out = 20
outliers = np.column_stack([(np.random.rand(n_out)-0.5)*5 for _ in range(3)])
point_cloud_outliers = np.concatenate((point_cloud, outliers))
plot_point_cloud(point_cloud_outliers)


In what follows, we will see how the addition of outliers does affect the information given by the persistent homology computation, for different choices of intrinsic input distances.

* **kNN-distance**

In [None]:
k = 5
kNN = KNeighborsGraph(n_neighbors=k)
X_kNN = kNN.fit_transform(point_cloud_outliers[None,:,:])
GGD = GraphGeodesicDistance()
kNN_matrix = GGD.fit_transform(X_kNN)

In [None]:
VR_kNN = VietorisRipsPersistence(homology_dimensions=[1], metric = 'precomputed')
diagram_kNN = VR_kNN.fit_transform(kNN_matrix)
VR_kNN.plot(diagram_kNN)

In comparison with the case without the outliers, we can see that outliers produced even more salient generators of $H_1$ when using kNN-distance.

* **Fermat distance**

In [None]:
p = 7
Fermat_matrix = compute_fermat_distance(point_cloud_outliers, p, int(np.log(len(point_cloud_outliers))))

In [None]:
VR_Fermat = VietorisRipsPersistence(homology_dimensions=[1], metric = 'precomputed')
diagram_Fermat = VR_Fermat.fit_transform(Fermat_matrix[None,:,:])
VR_Fermat.plot(diagram_Fermat)

After contrasting with the case without the outliers, it can be noticed that the persistent diagram for $H_1$ remains exactly the same when using Fermat distance as input. On the contrary, the addition of outliers to the point cloud should skyrocket the number of salient generators of $H_0$, since every outlier point is interpreted as a clear long-lasting connected component. 