# Incorporating Persistence in ML Pipelines
#### Author - Siddharth Setlur
In this tutorial, we're going to incorporate persistence features into a simple, interpretable ML model. The point here is that although we might lose some accuracy compared to state-of-the-art (SOTA) deep learning architectures like the ones we've seen like CNNs or DNNs, we gain interpretability and run time speed. We're going to be working with a random forest classifier, an extension to the decision tree architecture we say in the Knot theory tutorial. Along with computing persistence, the entire process takes under a minute (at least on my laptop which has just a cpu with 8GB RAM). 
Because using Gudhi with autodiff is tricky, we're going to use giotto - which is well integrated into scikit learn. We need to install giotto and a few other packages, so 
run 

```conda env create -f requirements.yml```

and activate
```conda activate tda-env-giotto```.

If the coda solving takes too long or it doesn't work, just use the tda-env we created earlier and pip install the required package whenever an import error is thrown. Just remember to pip install in the tda-env, i.e. run ```conda activate tda-env``` in the terminal before pip install  

## 3D Shape classification

Topological losses are most appropriate when the dataset that we're working with has a clear underlying shape that persistence can help detect. In this example, we're going to build a classifier that classifies a synthetic dataset comprised of 3D shapes. This notebook is based on the the [giotto-tda tutorial](https://github.com/giotto-ai/giotto-tda/blob/master/examples/classifying_shapes.ipynb)

In [None]:
from helper_functions.generate_datasets import make_point_clouds
import numpy as np
#get the point clouds and their labels
point_clouds_basic, labels_basic = make_point_clouds(n_samples_per_shape=10, n_points=20, noise=0.5)


The first step is always to examine the dataset we have. Pethaps, the first thing to do is to find the shape of the point clouds and the labels. 

In [None]:
point_clouds_basic.shape, labels_basic.shape

There are 30 labels, corresponding to the 30 different shapes we have. Each shape is a (400,3) array. But how many different labels are there, i.e. how many different kinds of shapes are we working with?

In [None]:
np.unique(labels_basic)

Let's plot a sample of each class

In [None]:
samples = []
samples_labels = []
#get a single sample for each label
for i in range(len(labels_basic)):
    if labels_basic[i] not in samples_labels:
        samples.append(#TODO )
        samples_labels.append(#TODO)
#Plot the point clouds on a 3D projection
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
# Create a figure with 3 subplots (one for each shape)
fig = plt.figure(figsize=(15, 5))
for i in range(#TODO):
    ax = fig.add_subplot(1, 3, i+1, projection='3d')
    ax.scatter(#TODO) #Hint - be careful  - you're plotting a 3D array!!
    ax.set_title(f'Shape {int(samples_labels[i])}')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')

plt.tight_layout()
plt.show()
#save the point clouds and their labels


What are the shapes? What do you expect their persistence diagrams to look like?

In [None]:
from gtda.homology import VietorisRipsPersistence
from gtda.plotting import plot_diagram
homology_dimensions = [0, 1, 2]
#Giotto has a very handy function to compute the persistence diagrams. Given a point cloud, it computes the VR complexes and then the persistence diagrams in the dimensions specified. 
VR_PD = VietorisRipsPersistence(homology_dimensions=homology_dimensions, collapse_edges=True) #this is a class that computes the diagrams given a point cloud, here we intiialize it to compute persistence in the 0,1,2 dimensions
#compute the persistence diagrams for the point clouds
#fit the persistence diagram
pd1 = VR_PD.fit_transform(samples[0][None,:,:]) #circle
pd2 = #TODO #sphere
pd3 = #TODO #torus

The VR class also comes witha nice plot function that plots the persistence diagrams

In [None]:
VR_PD.plot(pd1) #diagram for the circle

In [None]:
VR_PD.plot(#TODO) #diagram for the sphere

In [None]:
VR_PD.plot(#TODO) #diagram for the torus

As we saw in the lecture, we need to compute vectorizations of the persistence diagrams in order to feed it into ML pipelines. Again Giotto makes our lives very easy by providing classes for a bunch of common representations. Here, we use the persistence landscape.

In [None]:
from gtda.diagrams import PersistenceLandscape
landscape = PersistenceLandscape()

landscape_circ = landscape.fit_transform(pd1) #landscape for the circle
landscape_sph = landscape.fit_transform(#TODO) #landscape for the sphere
landscape_tor = #TODO #landscape for the torus


There are also super nice plotting functions for visualization! Can you see why a classifier fed the data of the persistence landscapes would be able to very easily classify the shapes?

In [None]:
landscape.plot(landscape_circ) #landscape for the circle

In [None]:
landscape.plot(#TODO) #landscape for the sphere

In [None]:
landscape.plot(#TODO) #landscape for the torus

We now train a classifier using just the landscapes. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
#split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(point_clouds_basic, labels_basic, test_size=0.2, random_state=42)
#Compute the persistence diagrams for the training and test sets
H_train = landscape.fit_transform(VR_PD.fit_transform(X_train))
H_test = #TODO

In [None]:
CLF = RandomForestClassifier(n_estimators=100, random_state=0, oob_score=True)
#The issue is that we can't just feed 3 vectors in to the classifier, we can only feed scalars, so we sum along each of the landscapes, i.e. for each point cloud we have 3 landscapes which are 3 vectors each of legnth 100. We sum each of the vectors to get 3 numbers to feed into the classifier for each point cloud.
CLF.fit(H_train.sum(axis=2), y_train)
CLF.oob_score_

Just summing along the landscapes is very crude, but it works very well (actually it works perfectly), but we will soon see that this is not the case for real-world data and we'll have to get creative. Let's delve more into the statistics. 

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

# Import necessary libraries for evaluation metrics
import matplotlib.pyplot as plt

# Print out-of-bag score (accuracy)
print(f"Out-of-bag accuracy: {CLF.oob_score_:.4f}")

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(len(CLF.feature_importances_)), CLF.feature_importances_)
plt.xlabel('Feature Index')
plt.ylabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

# Get predictions
y_pred = CLF.predict(H_test.sum(axis=2))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Circle", "Sphere", "Torus"])
plt.figure(figsize=(8, 6))
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Circle", "Sphere", "Torus"]))

The performance is amazing! How can we interpret the classifier (look at the feature importance plot)?

Let's try a more complicated dataset. We use a 3D dataset from a [Princeton comupter vision course](https://www.cs.princeton.edu/courses/archive/fall09/cos429/assignment3.html) comprised of 4 classes with 10 samples each i.e. 40 total clouds

In [None]:


from openml.datasets.functions import get_dataset
import pandas as pd
df = get_dataset('shapes').get_data(dataset_format='dataframe')[0]



Let's explore the dataframe

In [None]:
df.head()

Looks like each row contains a point and the label telling us which point cloud it belongs to. Let's see what the labels are

In [None]:
df['target'].unique()

Let's visualize each of the shapes

In [None]:
human_sample = df.query('target == "human_arms_out0"')[["x", "y", "z"]].values
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(human_sample[:, 0], human_sample[:, 1], human_sample[:, 2])
ax.set_title('Human Point Cloud')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

In [None]:
vase_sample = df.query('target == "vase0"')[["x", "y", "z"]].values
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(#TODO
ax.set_title('Vase Point Cloud')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

In [None]:
chair_sample = df.query('target == "dining_chair0"')[["x", "y", "z"]].values
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(#TODO)
ax.set_title('Chair Point Cloud')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

In [None]:
biplane_sample = df.query('target == "biplane0"')[["x", "y", "z"]].values
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(#TODO)
ax.set_title('Biplane Point Cloud')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

This is a weird way to label things - they've labelled each point cloud uniquely, i.e. we have human_arms_out0,...,human_arms_out9 and similarly for the other 3 classes. Somehow, we need to make a labelling array as we had in the toy example above, i.e. a 1-d numpy array of length 40 where each entry is either 0,1,2, or 3 depending on whether which class it belongs to

In [None]:
labels = np.zeros(40) # array with 40 zeros
labels[10:20] = 1 # label the samples 10-20 as 1 corresponding to the vase
labels[20:30] = 2 # label the samples 20-30 as 2 corresponding to the chair
labels[30:] = 3 # label the samples 30-40 as 3 corresponding to the biplane

The weird labelling method does make it easier to extract a list of point clouds though! We can iterate over the unique labels of the df, since each df label corresponds to a unique point cloud 

In [None]:
point_clouds = np.asarray(
    [
        df.query("target == @shape")[["x", "y", "z"]].values
        for shape in df["target"].unique()
    ]
)

In [None]:
homology_dimensions = [0, 1, 2]
VR_PD = VietorisRipsPersistence(homology_dimensions=homology_dimensions, collapse_edges=True)
landscape = PersistenceLandscape()
CLF = RandomForestClassifier(n_estimators=100, random_state=0, oob_score=True)
#fit the persistence diagram
#split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(point_clouds, labels, test_size=0.2, random_state=42)
#fit the persistence diagram
H_train = #TODO
CLF.fit(#TODO, y_train) #Remember to sum! 
CLF.oob_score_

In [None]:
H_test = landscape.fit_transform(VR_PD.fit_transform(X_test))
# Print out-of-bag score (accuracy)
print(f"Out-of-bag accuracy: {CLF.oob_score_:.4f}")

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(len(CLF.feature_importances_)), CLF.feature_importances_)
plt.xlabel('Feature Index')
plt.ylabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

# Get predictions
y_pred = CLF.predict(H_test.sum(axis=2))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Human", "Vase", "Chair", "Biplane"])
plt.figure(figsize=(8, 6))
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Human", "Vase", "Chair", "Biplane"]))

Let's add a few more features and see if we can improve our metrics, but our performance on such a small datatset using a simple classifier is already great! Look into changing/adding features using other vectorizations like PersistenceImage or Betti curves. See the [giotto documentation](https://giotto-ai.github.io/gtda-docs/latest/modules/diagrams.html#representations) for implementations of these features, (Hint - they work the same way as landscapes except fro a change in name). You could also look into [features](https://giotto-ai.github.io/gtda-docs/latest/modules/diagrams.html#features) like number of points in the diagram. Once you've decided on your feature, the pipeline is as follows 

(point_clouds, labels) -> (x_train, y_train) (x_test, y_test). 

Compute topological features - Landscape/Image(VR(x_train)) (or use the features like number of points)

Do some thing like summing if you have multiple vectors as we did for the landscape, essentially you can feed as many scalars as you want into the classifier but not vectors. 

Train the classifier using clf.fit(train_features)

Compute topological features on the test set

Predict using the classfier 

Display summary stats - Hint - you can basically copy the last cell displaying the statistics to do the last 2 steps with minor modifications depending on your pipeline

Finally interpret the classifier and discuss why you think the feature you chose improced the performance. 