# Introduction to Topological Data Analysis, Part II

In this notebook we will focus on methods for comparing topological signatures. We will try some more realistic applications to shape matching and image processing.

In [None]:
# Import standard packages for TDA and scientific computing
from ripser import ripser
from ripser import Rips
import persim
from persim import PersImage
from persim import plot_diagrams 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from sklearn import datasets
from math import ceil
import time

# Import packages for loading .mat files
import os 
from os.path import dirname, join as pjoin
import scipy.io as sio

## Example 1: Computing Persistence Diagrams, Review

Define a point cloud.

In [None]:
N = 1000
r = 5
R = 10

theta = 2*np.pi*np.random.rand(N)
phi = 2*np.pi*np.random.rand(N)
X = (R + r * np.cos(phi)) * np.cos(theta)
Y = (R + r * np.cos(phi)) * np.sin(theta) 
Z = r *  np.sin(phi)
pointCloud = np.append(X.reshape(N,1),Y.reshape(N,1),axis =1)
pointCloud = np.append(pointCloud,Z.reshape(N,1), axis = 1)

Plot the point cloud.

In [None]:
fig = plt.figure(figsize = (6,6))
ax = fig.gca(projection='3d', adjustable='box')
ax.scatter(pointCloud[:,0],pointCloud[:,1],pointCloud[:,2], c='b', marker='o');
# The following command doesn't seem to work here... I believe this is a known issue with matplotlib
ax.set_aspect('equal');

Using `ripser`, we compute the persistent homology of our point cloud and plot the resulting persistence diagrams. We only compute homology up to degree-1 for the sake of time.

In [None]:
dgms = ripser(pointCloud)['dgms']
fig = plt.figure(figsize=(6,6))
plot_diagrams(dgms, show=True)

Does the output make sense? How does it change if we play with the parameters in the creation of our point cloud?

## Example 2: Bottleneck Distance and Shape Classification

### Loading and Exploring the Data

First we need to load the data set. The data consists of a large number of densely sampled plane curves representing various objects (bones, dogs, cars, etc.). The file is a .mat file, which we read into Python with the following commands.

In [None]:
data_dir = os.getcwd() # Get the current working directory name.
mat_fname = pjoin(data_dir, 'planarShapes.mat') 
# Add the file name to the current working directory.

mat_contents = sio.loadmat(mat_fname) # Read the file

Let's take a look at what is contained in the file.

In [None]:
mat_contents

Looks like we need to separate the actual data from the metadata. The types of data in the file are listed under several "keys". 

In [None]:
mat_contents.keys()

The plane curves we are after are under the 'planarShapes' key. Let's extract that from the mat file.

In [None]:
planarShapes = mat_contents['planarShapes']
planarShapes.shape

The second command above shows that planarShapes is a 2x100x1300 array. Exploring more, we would find that there are 1300 separate shapes, separated into 20 copies of similar shapes (so 65 classes of similar shapes). Each of the 1300 shapes is a pointcloud in $\mathbb{R}^2$ consisting of 100 points. Let's plot a couple of the shapes below.

In [None]:
shape_indices = [4*x for x in range(25)]

fig = plt.figure(figsize = (15,15))

for j in range(25):
    ax = fig.add_subplot(5,5,j+1)
    shape = planarShapes[:,:,shape_indices[j]]
    ax.plot(shape[0,:], shape[1,:], linewidth=3)
    ax.axis('off')
    ax.axis('equal')

The above code plots the shapes as continuous curves, but the data for each shape is really a point cloud.

In [None]:
plt.figure(figsize=(5,5))
shape = planarShapes[:,:,4]
plt.scatter(shape[0,:], shape[1,:])
plt.axis('off')
plt.axis('equal');

### Initial Experiments with Bottleneck Distance

Let's fix some shapes to use for examples.

In [None]:
shape_indices = [1,125,127] # Pick some shapes.

num_shapes = len(shape_indices) 
# If you want to pick different shape indices it will be useful to save this as variable.

fig = plt.figure(figsize = (10,5))

for j in range(num_shapes):
    ax = fig.add_subplot(1,3,j+1)
    shape = planarShapes[:,:,shape_indices[j]]
    ax.plot(shape[0,:], shape[1,:], linewidth=3)
    ax.axis('off')
    plt.title('Shape '+str(j))
    ax.axis('equal')

Let's compute persistence diagrams for these examples, then look at bottleneck distances between them. Note that ripser prefers the pointclouds to be transposed. I.e., shape1 is given as a 2x100 array, but ripser wants to see a 100x2 array.

In [None]:
shapeDgms = [ripser(planarShapes[:,:,shape_indices[j]].T)['dgms'] for j in range(num_shapes)]

fig = plt.figure(figsize=(20,10))

for j in range(num_shapes):
    ax = fig.add_subplot(1,num_shapes,j+1) # You might need to change this layout if you change shape_indices
    plt.title('PD for Shape '+str(j))
    plot_diagrams(shapeDgms[j])

The "persim" package includes several distance metrics between persistence diagrams, including the bottleneck distance that we have defined in class. Let's compute bottleneck distances between our shape examples. There is an option to not only compute the distance, but to record the optimal matching which produces it. In the first example, we compute the bottleneck distance between the degree-1 persistence diagrams for shapes with indices 0 and 1.

In [None]:
distance_bottleneck, (matching, D) = persim.bottleneck(shapeDgms[0][1], shapeDgms[1][1], matching=True)
print(distance_bottleneck)

We can then plot the persistence diagrams on the same axes and display the optimal matching. The green line segment indicates matched points incurring the highest cost.

In [None]:
persim.plot.bottleneck_matching(shapeDgms[0][1], shapeDgms[1][1], matching, D, labels=['shape0', 'shape1'])

Let's also compute the distance between shapes with indices 0 and 2.

In [None]:
distance_bottleneck, (matching, D) = persim.bottleneck(shapeDgms[0][1], shapeDgms[2][1], matching=True)
print(distance_bottleneck)

In [None]:
persim.plot.bottleneck_matching(shapeDgms[0][1], shapeDgms[2][1], matching, D, labels=['shape0', 'shape2'])

Finally, we compute the distance between shapes 1 and 2.

In [None]:
distance_bottleneck, (matching, D) = persim.bottleneck(shapeDgms[1][1], shapeDgms[2][1], matching=True)
print(distance_bottleneck)

persim.plot.bottleneck_matching(shapeDgms[1][1], shapeDgms[2][1], matching, D, labels=['shape1', 'shape2'])

So bottleneck distance seems to pick up on differences in the shapes. We can summarize the distances by computing a *distance matrix*. Since we are comparing 3 shapes, the distance matrix will be a $3 \times 3$ matrix whose $(i,j)$-entry is the bottleneck distance between Shape $i$ and Shape $j$.

In [None]:
# Compute the distance matrix
distMat = np.zeros((3,3))

for i in range(3):
    for j in range(3):
        distMat[i,j] = persim.bottleneck(shapeDgms[i][1], shapeDgms[j][1], matching=True)[0]
        
# Display the distance matrix
img = plt.imshow(distMat)
img.set_cmap('hot')
plt.colorbar()
plt.axis('off');

### Classification Experiment

Let's now try a more serious supervised learning experiment. We'll pick several shape classes, then several examples of shapes from each class. The goal is to see whether bottleneck distance between persistence diagrams will work as a classifier for the shapes.

In [None]:
shape_classes = [100,200,300,400,500,600,700,800] # Pick indices of shape classes to sample.
num_classes = len(shape_classes)
num_shapes = 16
# Pick number of examples to take from each shape class
# Pick carefully so that we get shapes from the same class!
# Remember the shapes come in groups of 20

# Create labels for the data
labels = []

for j in range(num_classes):
    labels = labels + num_shapes*[j]

# List all indices of the shape samples for the experiment.
samples =[]
for j in range(num_classes):
    samples = samples+range(shape_classes[j],shape_classes[j]+num_shapes)

# We now pick out the shapes with indices in 'samples' and preprocess.
num_samp = len(samples)

shapeSamples = [planarShapes[:,:,samples[j]].T for j in range(num_samp)]

Let's take a look at shapes from each of the shape classes.

In [None]:
fig = plt.figure(figsize=(15,5))

for j in range(num_classes):
    shape_example = shapeSamples[j*num_shapes]
    ax = fig.add_subplot(1,num_classes,j+1)
    ax.plot(shape_example[:,0], shape_example[:,1], linewidth = 3)
    plt.title('Shape Class '+str(j))
    ax.axis('off')
    ax.axis('equal')
    

Let's also look at the samples within a given class of shapes.

In [None]:
fig = plt.figure(figsize=(10,2))

for j in range(num_shapes):
    shape_example = shapeSamples[j]
    ax = fig.add_subplot(2,int(ceil(num_shapes)/2),j+1)
    ax.plot(shape_example[:,0], shape_example[:,1], linewidth = 3)
    ax.axis('off')
    ax.axis('equal')

Next we compute persistence diagrams for each of the shapes in our sampled collection. We are using the standard options for ripser, which will compute degree-0 and degree-1 persistence diagrams.

In [None]:
shapeSamplesDgms = [ripser(shapeSamples[j])['dgms'] for j in range(num_samp)]

We can now compute bottleneck distances between all pairs of persistence diagrams. We computed persistence diagrams for degree-0 and degree-1 persistent homology in the previous cell, so we could compute bottleneck distances in each degree.

It takes quite a while to run the degree-0 computation. It is commented out in the cell below because I don't want to run it in class. Feel free to uncomment and try it yourself.

In the cell below that, we compute the distance matrix for the degree-1 persistence diagrams. Note that, if there are N total shape samples, then this distance matrix should be an NxN symmetric matrix. The $(i,j)$-entry is the bottleneck distance between the persistence diagram of shape $i$ and shape $j$. We are really thinking of each shape as a point in a metric space!

In [None]:
# Uncomment this cell and run it if you want to!

#distMatDeg0 = np.zeros([num_samp,num_samp])

#for j in range(num_samp):
#    for k in range(j+1,num_samp):
#        distMatDeg0[j,k] = persim.bottleneck(shapeSamplesDgms[j][0], shapeSamplesDgms[k][0])
#    print(j)

#distMatDeg0 = distMatDeg0 + np.transpose(distMatDeg0)

In [None]:
distMatDeg1 = np.zeros([num_samp,num_samp])

for j in range(num_samp):
    for k in range(j+1,num_samp):
        distMatDeg1[j,k] = persim.bottleneck(shapeSamplesDgms[j][1], shapeSamplesDgms[k][1])

distMatDeg1 = distMatDeg1 + np.transpose(distMatDeg1)

To understand the structure of our metric space, we can take a look at the distance matrix.

In [None]:
plt.figure(figsize=(5,5))
plt.imshow(distMatDeg1)
plt.colorbar();

Our goal is to test whether bottleneck distance works as a good classifier. To get a sense of what "good" means, we would like to compare to other standard classification techniques. Most standard algorithms take vectors as input, so let's reshape our data to put it in vector form.

In [None]:
X = [shapeSamples[j].reshape(200,) for j in range(num_samp)]
y = labels

One thing we can compare to is *Procrustes distance*. This will take a pair of shapes, try to rotate and translate one of them so that they align as much as possible, then take sum of squared distances between aligned points. I want to use Procrustes distance as a callable function later, so I will define it now.

In [None]:
from scipy.spatial import procrustes

def procDist(X,Y):
    X1 = X.reshape(100,2)
    Y1 = Y.reshape(100,2)
    m1, m2, disp = procrustes(shapeSamples[j], shapeSamples[k])
    return disp

We can then compute the Procrustes distance matrix for our shape data, as we did above for bottleneck distance.

In [None]:
distMatProcrustes = np.zeros([num_samp,num_samp])

for j in range(num_samp):
    for k in range(j+1,num_samp):
        distMatProcrustes[j,k] = procDist(X[j],X[k])

distMatProcrustes = distMatProcrustes + np.transpose(distMatProcrustes)

plt.figure(figsize=(5,5))
plt.imshow(distMatProcrustes)
plt.colorbar();

We will also use a callable function for bottleneck distance, so I will define one now.

In [None]:
def bottleneckDist(X,Y):
    X1 = X.reshape(100,2)
    Y1 = Y.reshape(100,2)
    dgm1 = ripser(X1)['dgms'][1]
    dgm2 = ripser(Y1)['dgms'][1]
    return persim.bottleneck(dgm1, dgm2)   

To test classification rate, we split our data into a training set and a testing set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)

Now we use $k$-Nearest Neighbors classifiction. We will test the classification performance of bottleneck distance and of procrustes distance. Notice that the `scikit-learn` implementation of $k$-NN allows us to use callable functions for our distance metric. This is why I defined these earlier.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Define the bottleneck distance model and fit
neighBottleneck = KNeighborsClassifier(n_neighbors=3, algorithm='ball_tree', metric = bottleneckDist)
neighBottleneck.fit(X_train, y_train) 

# Define the Procrustes distance model and fit
neighProc = KNeighborsClassifier(n_neighbors=3, algorithm='ball_tree', metric = procDist)
neighProc.fit(X_train, y_train) 

Now we find the classification rate for each metric on the testing set. This takes a while, by the nature of $k$NN, the fact that we are using callable functions as our metrics, and the slowness of the bottleneck distance computation.

In [None]:
neighBottleneck.score(X_test,y_test)

In [None]:
neighProc.score(X_test,y_test)

We see that bottleneck distance performs relatively well! This is made more apparent by the quite poor performance of Procrustes distance, which most would consider to be the "standard" metric to use here.

We can use *multidimensional scaling (MDS)* to get a feel for why each metric does a good/bad job. MDS attempts to embed the metric space defined by each distance matrix into $\mathbb{R}^2$ or $\mathbb{R}^3$ (this is chosen by the user), in order to visualize clustering behavior.

In [None]:
# Import a package containing the MDS algorithm and set options for the algorithm
from sklearn import manifold
mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6)

# Compute MDS and extract the coordinates of the points
results = mds.fit(distMatDeg1)
coords = results.embedding_

plt.scatter(coords[:,0],coords[:,1], c=labels, cmap = 'hot')

In [None]:
results = mds.fit(distMatProcrustes)
coords = results.embedding_

plt.scatter(coords[:,0],coords[:,1], c=labels, cmap = 'hot')

To see what was predicted correctly/incorrectly, we can look at the *confusion matrix*. For a general multiclass classification problem with labels $0,1,\ldots,K$, the confusion matrix is the $(K+1) \times (K+1)$ matrix
$$
C = (C_{ij}) = \left(\begin{array}{cccc}
C_{00} & C_{01} & \cdots & C_{0K} \\
C_{10} & C_{11} & \cdots & C_{1K} \\
\vdots & \vdots & & \vdots \\
C_{K0} & C_{K1} & \cdots & C_{KK} \end{array}\right)
$$
with entry $C_{ij}$ giving the number of observations known to be in group $i$ and predicted to be in group $j$.

This can be computed via `scikit-learn` as follows.

In [None]:
from sklearn import metrics
predictedBottleneck = neighBottleneck.predict(X_test)
print(metrics.confusion_matrix(y_test, predictedBottleneck))

In [None]:
predictedProc = neighProc.predict(X_test)
print(metrics.confusion_matrix(y_test, predictedProc))

Apparently Procrustes distance was extrememly confused by Shape 2!

We can also try to naively apply a Support Vector Machine to classify the shapes. Recall that this works extremely well on most toy data sets (e.g., `iris` or `digits` (MNIST)). Here it performs quite poorly.

In [None]:
from sklearn.svm import SVC
svm = SVC(gamma='auto')
svm.fit(X_train, y_train) 
svm.score(X_test,y_test)

In [None]:
predictedSVM = svm.predict(X_test)
print(metrics.confusion_matrix(y_test, predictedSVM))

Apparently SVM is extrememly confused by Shape 4!

Of course, this was a naive attempt. There is certainly a smarter way to vectorize the shape data which will produce better classification results with SVM.

### Homework

Do more preprocessing of the data to improve SVM classification.

Ideas:
- Turn each curve into an 'image' i.e. a rectangular array of pixels with a black pixel value for points on the curve, white pixel value for points off the curve.
- Try smoothing this image, e.g. using the function from the previous notebook.
- Try 'filling in' the images, so that everything inside the shapes boundary gets a black pixel.
- All of these approaches require some 'registration' for the shapes. I.e., the shapes need to be aligned within each class, and maybe rescaled. This is becoming more complicated... Note that this step is not necessary for bottleneck distance, since persistent homology is not affected by rotations or translations.

Do any of these help SVM to beat bottleneck distance? What if you run bottleneck distance on the newly preprocessed images?

## Example 3: Vectorizing Persistence Diagrams and Image Processing

This is adapted from the demonstration available at https://github.com/scikit-tda/persim.

This notebook shows how you can use persistent homology and persistence images to classify datasets.  We construct datasets from two classes, one just noise and the other noise with a big circle in the middle. We then compute persistence diagrams with `ripser` and convert them to persistence images with `persim`. Using these persistence images, we build a Logistic Regression model to decide whether the dataset has a circle or not.

### Construct data

We will construct a data set consisting of several pointclouds which are just noise and several pointclouds which are noise on top of a noisy circle. The goal is to see whether tools from persistent homology can be used to classify the data into two categories. Searching for structure in a noisy signal is an important task in many imaging applications, e.g. https://en.wikipedia.org/wiki/Transmission_electron_cryomicroscopy.

We begin by defining a function which generates a noisy point cloud sampled from a sphere.

In [None]:
def sample_spherical(npoints, scale=1, ndim=2):
    vec = np.random.randn(ndim, npoints)
    vec /= np.linalg.norm(vec, axis=0)
    vec = scale*np.transpose(vec)
    return vec

def noisy_sphere(npoints, scale=1, noiseLevel=0.3, offset = 0, ndim=2):
    data = sample_spherical(npoints,scale, ndim)+scale*noiseLevel*np.random.random((npoints,ndim)) + offset*np.ones(ndim)
    return data

# npoints = number of points sampled
# scale = radius of the sphere being sampled
# noiseLevel = how "noisy" the samples are
# offset = shifts the circle in the (1,1,...,1) direction

Let's test it to make sure it produces the desired output.

In [None]:
data = noisy_sphere(200, scale=10, offset=50)

fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(data[:, 0], data[:, 1], 'ob');
ax1.axis('equal');

Next we define a function which produces a noisy point cloud in space, with a given scale, in a given dimension.

In [None]:
def noise(N, scale, ndim=2):
    return scale * np.random.random((N, 2)) 

Now we construct our data set. There are lots of parameters to change to produce different experiments. In general we generate samples of random noise and samples of random noise with noisy circles embedded in them. We can change numbers of samples, how noisy everything is, size and location of the circles, etc.

In [None]:
# We first define a bunch of parameters for the experiment
total_samples = 200 # Number of samples in our experiment
samples_per_class = int(total_samples / 2) 
# Number of samples which are just noise. We will do our experiment taking an equal number each type of sample.
npoints = 400
# Number of points in each point cloud sample.
circle_noise_level = 0.5 # Noise level for the circles.
snr = 0.3
# Signal-to-noise ratio. For the images containing circles, this is the percentage of points that belong to the circle.
noise_scale = 150 # How large the noisy point cloud is
circle_scale = 30 # How large the noisy circle is. Should be chosen relative to the noise_scale.

# Generate the pure noise samples
just_noise = [noise(npoints, noise_scale) for _ in range(samples_per_class)]

# Generate the noise+circle samples
with_circle = [np.concatenate((noisy_sphere(int(snr*npoints), noiseLevel = circle_noise_level, scale=circle_scale, offset=(0.2*np.random.random()+0.4)*noise_scale), noise(npoints-int(snr*npoints), noise_scale)))
               for _ in range(samples_per_class)]

# Combine all the samples into one list.
datas = []
datas.extend(just_noise)
datas.extend(with_circle)

# Define labels. Pure noise samples are labeled with 0, noise+circles are labeled with 1.
# These labels will be used to train/test our classifyer.
labels = np.zeros(total_samples)
labels[samples_per_class:] = 1

Let's visualize the data. We'll plot a random pure noise sample and a random circle+noise sample.

In [None]:
fig, axs = plt.subplots(1, 2)
fig.set_size_inches(10,5)

random_sample_choice = np.random.randint(100) # Pick a number between 0 and 100.

xs, ys = just_noise[random_sample_choice][:,0], just_noise[random_sample_choice][:,1]
axs[0].scatter(xs, ys)
axs[0].set_title("Example noise dataset")
axs[0].set_aspect('equal', 'box')

xs_, ys_ = with_circle[random_sample_choice][:,0], with_circle[random_sample_choice][:,1]
axs[1].scatter(xs_, ys_)
axs[1].set_title("Example noise with circle dataset")
axs[1].set_aspect('equal', 'box')

fig.tight_layout()

### Compute homology of each dataset

Generate the persistence diagram of $H_1$ for each of the datasets generated above.

In [None]:
start0 = time.time()
rips = Rips(maxdim=1, coeff=2); # Apply ripser with Z_2 coefficients, up to dimension 1.
diagrams = [rips.fit_transform(data) for data in datas]
diagrams_h1 = [rips.fit_transform(data)[1] for data in datas]
end0 = time.time()

print('Computation Time: ' + str(end0 - start0) + ' seconds')

Let's plot persistence diagrams for a random choice of noise sample and a random choice of circle+noise. We'll use the "lifetime" plot style option because it will match will something we'll do with the diagrams later.

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(121)

random_sample = np.random.randint(50)

rips.plot(diagrams_h1[random_sample], show=False, legend=False, lifetime=True)
plt.title("PD of $H_1$ for just noise")

plt.subplot(122)
rips.plot(diagrams_h1[-random_sample], show=False, legend=False, lifetime=True)
plt.title("PD of $H_1$ for circle w/ noise")

plt.show()

### Compute persistence images

Our goal is to do statistics on these persistence diagrams. Classical statistics is performed on vector spaces, but these PDs are not vectors! A popular approach to this problem is to find a way to construct a vector for each PD (so that the vectors live in the same vector space). The particular approach we will follow is to construct \emph{persistence images}, first introduced in this paper: https://arxiv.org/abs/1507.06217

We will discuss the definition of a persistence image in class. The 'persim' package is made to handle persistence images.

In [None]:
pim = PersImage(pixels=[20,20], spread=1)
imgs = pim.transform(diagrams_h1)

Let's take a look at the persistence images. We should interpret each image as the pixelated plot of a function $\mathbb{R}^2 \rightarrow \mathbb{R}$, with colors corresponding to height of the plot. Such a plot can be understood as a vector: by reshaping the 20-pixel-by-20-pixel image into a list of 400 numbers, we can think of it as a vector in $\mathbb{R}^{400}$.

In [None]:
plt.figure(figsize=(15,7.5))

for i in range(4):
    ax = plt.subplot(240+i+1)
    pim.show(imgs[i], ax)
    plt.title("PI of $H_1$ for noise")

for i in range(4):
    ax = plt.subplot(240+i+5)
    pim.show(imgs[-(i+1)], ax)
    plt.title("PI of $H_1$ for circle w/ noise")

Roughly, these look like blurry plots of the persistence diagrams, with height corresponding roughly to density of points (technically, points farther from the $x$-axis are also given a higher "weight").

## Classify the datasets from the persistence images

So far we have the following pipeline: collection of datasets, organized into two categories --> collection of persistence diagrams --> collection of vectorized representations of these PDs. Now we can use classical statistical techniques to attempt to classify our data.

We first flatten each persistence image so that it is really represented as a vector in $\mathbb{R}^{400}$.

In [None]:
imgs_array = np.array([img.flatten() for img in imgs])

Next we randomly divide our data into a training set and a testing set (with labels for each).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(imgs_array, labels, test_size=0.30, random_state=1)

Next we perform a logistic regression on the training data. Essentially, the goal of the logistic regression algorithm is to find a vector in $\mathbb{R}^{400}$ such that the label of each vector (persistence image) in the training set can be predicted by taking its dot product with the vector we find.

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'lbfgs')
lr.fit(X_train, y_train)

Finally, we predict labels for the test data based off of the logistic regression. That is, we take the dot product with the vector we found by logistic regression and assign a label based on the result. We then check whether the predicted label agrees with the true label. We count the ratio of correct labels to get the LR score for the test data.

In [None]:
lr.score(X_test, y_test)

We see that our predictor performs (surprisingly, to me) extremely well!

### Visualizing the Weights

We can visualize the weight vector that was found by logistic regression. It is a vector in $\mathbb{R}^{400}$, but we can reshape it into a 20-pixel-by-20-pixel array, so that it matches the form of the persistence images. This should give an idea of which features are most important in classification.

In [None]:
inverse_image = lr.coef_.reshape((20,20))
plt.imshow(inverse_image)
plt.colorbar();

We see a spot in the lower left corner is a strong indicator of 'circle with noise'. The middle region, on the other hand, seems to correlate with 'just noise'.

### Sanity Check: A Naive Approach

To test whether this excellent performance is a result of the efficacy of logistic regression, or the persistence image representation is actually useful, let's try to classify with a more naive approach.

Each sample in our data set is an npoints-by-2 array. We could flatten each of these to get a vector in $\mathbb{R}^{2 \cdot \mbox{npoints}}$, then run the same regression procedure to see if logistic regression can classify these vectors. 

(This is actually not so crazy! The way the noise+circle samples were constructed should give these $2\cdot \mbox{npoints}$-dimensional vectors a different structure than the pure noise samples.)

First we flatten the data, split into training and test sets and perform regression.

In [None]:
datas_array = np.array([sample.flatten() for sample in datas])

X_train, X_test, y_train, y_test = train_test_split(datas_array, labels, test_size=0.40, random_state=42)

lr = LogisticRegression(solver = 'lbfgs')
lr.fit(X_train, y_train)

Then we test the regression score.

In [None]:
lr.score(X_test, y_test)

We see that our regression does a little better than random guessing (at least if snr is not too low), but not nearly as well as the persistence images approach! 

Just for fun, let's take a look at the vector we are using to do the regression.

In [None]:
inverse_image = lr.coef_.reshape((400,2))

plt.scatter(inverse_image[:,0],inverse_image[:,1])
plt.axis('equal');

### Homework

1) Create new toy data where noisy images contain multiple circles, partial circles, blobs, etc. Do persistence images also do a good job on your new toy dataset?

2) Try something like this on real data; for example take the picture from https://en.wikipedia.org/wiki/Transmission_electron_cryomicroscopy, cut the picture into a grid, use the grid squares as data. Can you use these methods to determine whether a grid square contains a molecule image?